Scrapito

Templated and easy to use scraping tool.

Usage

Template Basics

Here is an empty template for example.

example.yaml:

name: ExampleTemplate # Required
timeout: 3600000 # Default value (ms)
renderJS: false # Default value
maxThreads: 10 # Default value
maxRetries: 2 # Default value
version: 1 # Required, needs to be === 1

params: # TODO:
  - query: !!string
  - page: !!number

# Contains scraping flow
pipelines: []

Pipelines

Each pipeline represent a group of one or more selector. It's used to group results.

Parameters

name:

Used to describe content.

name = string

url:

Used to target one or more webpage.

url ?= string | string[]

selector:

Used to target one or more HTML elements.

Uses the querySelector syntax

selector = string

attribute:

Used to extract informations from HTML elements.

attribute ?= string

transform:

Used to transform the result. Internally treated as a function having a res parameter.

eval("(res: any) => " + transform)

transform ?= string

Basics

This example below is listing all urls from the explore page on GitHub.

pipelines:
  - name: list_repos
    url: https://github.com/explore
    selector: article h1 a.text-bold
    attribute: href

Now if we want to get a list of trending developers we can do this:

pipelines:
  - name: list_repos
    url: https://github.com/explore
    selector: article h1 a.text-bold
    attribute: href

  - name: list_devs
    url: https://github.com/trending/developers
    selector: article h1.h3 a
    attribute: href

The output will be similar to this:

{
  "list_repos": [
    // Many repos URLs
  ],
  "list_devs": [
    // Many devs profile URLs
  ]
}

Multi-URLs

We can easily scrap many pages at once like this:

pipelines:
  - name: list_repos_and_devs
    url:
      - https://github.com/trending
      - https://github.com/trending/developers
    selector: article h1.h3 a
    attribute: href

Will returns:

{
  "list_repos_and_devs": [
    // Many repos URLs
    // Many devs profile URLs
  ]
}

Use responses as argument

You may want to use some older results to target something in another site/page. Values identifiers are here to solve this issue.

Here is a simple example:

pipelines:
  - name: list_repos
    url:
      - https://github.com/trending
    selector: article h1.h3 a
    attribute: href
    next:
      - name: last_commit_time
        url: map@list_repos
        selector: .Box relative-time
        attribute: $text

Similar to:

pipelines:
  - name: list_repos
    url:
      - https://github.com/trending
    selector: article h1.h3 a
    attribute: href

  - name: last_commit_time
    url: map@list_repos
    selector: .Box relative-time
    attribute: $text
    wait: [pipe::list_repos]

The map@list_repos url modifier make the pipeline act like if it has the url parameter filled with list_repos results.

You can also use a number istead of map before the @. For example 0@list_repos will only take the first result of the list_repos pipeline and use it as a value.

Handle dependencies

As you may have seen in the previous example we have to use wait: [pipe::list_repos] in order to wait for the list_repos pipeline to finish before processing the last_commit_time pipeline.

You can have multiple dependencies for one pipeline.

pipelines:
  - name: do_first

  - name: do_first_too

  - name: do_second
    wait: [pipe::do_first]

  - name: do_third
    wait: [pipe::do_second]

  - name: do_last
    wait: [pipe::do_third, pipe::do_first_too]

TODO: Talk about

Development

Important notes

If you permform some changes in rollup config or package.json run the npm run refresh command each time you make a change to be able to test the result properly.
- It may save you an hour or so ...

Use-cases

Be able to paste some HTML on a web app, then editor to the right allow you to write the json template.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.vscode		.vscode
packages		packages
tests		tests
.babelrc		.babelrc
.editorconfig		.editorconfig
.eslintignore		.eslintignore
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
ex.scrapito.yml		ex.scrapito.yml
jsconfig.json		jsconfig.json
lerna.json		lerna.json
package-lock.json		package-lock.json
package.json		package.json
stylelint.config.js		stylelint.config.js
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrapito

Usage

Template Basics

Pipelines

Parameters

Basics

Multi-URLs

Use responses as argument

Handle dependencies

Development

Important notes

Use-cases

About

Uh oh!

Releases

Packages

Uh oh!

Languages

pulsar-inc/scrapito

Folders and files

Latest commit

History

Repository files navigation

Scrapito

Usage

Template Basics

Pipelines

Parameters

Basics

Multi-URLs

Use responses as argument

Handle dependencies

Development

Important notes

Use-cases

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages