Skip to content
@commoncrawl

Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Pinned Loading

  1. cc-pyspark cc-pyspark Public

    Process Common Crawl data with Python and Spark

    Python 448 90

  2. cc-crawl-statistics cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    Python 203 16

  3. cc-index-table cc-index-table Public

    Index Common Crawl archives in tabular format

    Java 124 14

  4. cc-warc-examples cc-warc-examples Public

    Forked from Smerity/cc-warc-examples

    CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop

    Java 37 18

  5. cc-citations cc-citations Public

    Scientific articles using or citing Common Crawl data

    Jupyter Notebook 27 5

  6. cc-notebooks cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    Jupyter Notebook 60 10

Repositories

Showing 10 of 75 repositories
  • cc-crawl-statistics Public

    Statistics of Common Crawl monthly archives mined from URL index files

    commoncrawl/cc-crawl-statistics’s past year of commit activity
    Python 203 Apache-2.0 16 2 0 Updated Dec 3, 2025
  • nutch Public Forked from Aloisius/nutch

    Common Crawl fork of Apache Nutch

    commoncrawl/nutch’s past year of commit activity
    Java 39 Apache-2.0 1,268 6 (1 issue needs help) 0 Updated Dec 3, 2025
  • crawler-commons Public Forked from crawler-commons/crawler-commons

    A set of reusable Java components that implement functionality common to any web crawler

    commoncrawl/crawler-commons’s past year of commit activity
    Java 2 Apache-2.0 91 0 2 Updated Dec 2, 2025
  • cc-citations Public

    Scientific articles using or citing Common Crawl data

    commoncrawl/cc-citations’s past year of commit activity
    Jupyter Notebook 27 5 0 0 Updated Dec 1, 2025
  • web-languages Public

    Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ for the code

    commoncrawl/web-languages’s past year of commit activity
    65 84 3 1 Updated Nov 25, 2025
  • cc-webgraph-statistics Public

    Statistics of Common Crawl monthly Web Graphs

    commoncrawl/cc-webgraph-statistics’s past year of commit activity
    Python 5 Apache-2.0 1 2 0 Updated Nov 24, 2025
  • cc-webgraph Public

    Tools to construct and process Common Crawl webgraphs

    commoncrawl/cc-webgraph’s past year of commit activity
    Java 102 Apache-2.0 4 2 (1 issue needs help) 0 Updated Nov 23, 2025
  • cc-notebooks Public

    Various Jupyter notebooks about Common Crawl data

    commoncrawl/cc-notebooks’s past year of commit activity
    Jupyter Notebook 60 Apache-2.0 10 0 0 Updated Nov 22, 2025
  • warcio-s3 Public Forked from webrecorder/warcio

    Streaming WARC/ARC library for fast web archive IO

    commoncrawl/warcio-s3’s past year of commit activity
    Python 0 Apache-2.0 65 0 1 Updated Nov 21, 2025
  • whirlwind-python Public

    A whirlwind tour of Common Crawl's data using Python

    commoncrawl/whirlwind-python’s past year of commit activity
    Python 29 Apache-2.0 6 0 0 Updated Nov 20, 2025