This Python script identifies internal linking opportunities within a list of website URLs by searching for unlinked mentions of specified keywords. It fetches the content of the provided site URLs, tokenizes the text into sentences, and searches for sentences that contain the target keywords but do not already contain a link. If such sentences are found, they are recorded as potential internal link opportunities.
- Fetches content from a list of site URLs using the Jina AI API.
- Uses NLTK for sentence tokenization.
- Identifies sentences containing unlinked mentions of target keywords.
- Skips sentences that already contain links or are headings/formatting.
- Normalizes URLs to avoid linking to the same page.
- Handles errors during content fetching and continues processing.
- Python 3.x
- Required Python packages:
pandas
requests
nltk
- An API key for the Jina AI API.
Required Python packages:
pip install pandas requests nltk
Download NLTK data:
The script will automatically download the required NLTK data (punkt
) if not already installed. Alternatively, you can download it manually:
import nltk
nltk.download('punkt')
-
target_keywords.csv
:A CSV file containing the target URLs and associated keywords. Each line should contain a target URL and a keyword search for, separated by a comma. No headers are required.
Format:
target_url,keyword
Example:
https://example.com/pageA,keyword1 https://example.com/pageB,keyword2 https://example.com/pageC,keyword3
Note: The first column is the target URL, and the second column is the keyword.
-
site_urls.csv
:A CSV file containing the site URLs to scrape for internal linking opportunities. Each URL should be on a separate line, and no headers are required.
Example:
https://example.com/page1 https://example.com/page2 https://example.com/page3
The script requires an API key for the Jina AI API to fetch the content of the URLs. You need to obtain an API key from Jina AI.
-
Open the script file in a text editor.
-
Replace the placeholder
{your Jina.ai API key here}
with your actual API key:# Set your API key api_key = '{your Jini.ai API key here}'
Be sure to remove the brackets {} when entering your api key.
Ensure that the input files (site_urls.csv
and target_keywords.csv
) are in the same directory as the script.
Run the script using the following command:
python internal_link_optimizer.py
The script will process each URL, fetch the content, and search for unlinked mentions of the keywords.
The script generates the following output files:
-
content.csv
:A CSV file containing the source URLs and their fetched body texts.
Format:
source_url,body_text
-
unlinked_keywords.csv
:A CSV file containing the found internal link opportunities. Each row contains:
- Source URL: The URL where the unlinked keyword was found.
- Sentence: The sentence containing the unlinked keyword.
- Keyword: The keyword found.
- Target URL: The target URL associated with the keyword.
Format:
Source URL,Sentence,Keyword,Target URL
-
Sentence Skipping Criteria:
The script skips sentences that:
- Do not end with punctuation (e.g.,
.
,?
,!
). - Are headings or formatted text (e.g., start and end with
#
,**
, or*
). - Already contain links (identified by markdown link syntax
[text](url)
).
- Do not end with punctuation (e.g.,
-
URL Normalization:
The script normalizes URLs to:
- Ensure consistent comparison between source and target URLs.
- Avoid suggesting links to the same page.
- Handle cases where URLs have different schemes (
http
vs.https
),www
prefixes, or trailing slashes.
-
Error Handling:
If an error occurs while fetching the content of a URL (e.g., network issues, invalid URLs), the script records the error and continues processing the remaining URLs.
-
NLTK Data Download:
The script checks if the required NLTK data (
punkt
) is downloaded. If not, it will automatically download it at runtime.