Skip to content

Vizier Architecture

Oliver edited this page Apr 21, 2021 · 3 revisions

Vizier Architecture

The Vizier system consists of the components shown in the architecture diagram shown below.

  • API: Vizier uses an API layer to manages notebook state and mediates between the components. The API may be accessed directly (e.g., by scripts), or via Vizier's UI.
  • UI: Vizier relies on a HTML/JS-based frontend for most user interactions.
  • Scheduler: A scheduler is responsible for evaluating dependencies between notebook cells and re-executing cells that are out-of-date (whether because the cell was updated or one of its inputs changed in a new notebook version).
  • Datastore: Structured data (dataframes) and simple unstructured data (blobs) are stored in the Datastore layer. In addition to keeping track of this state, the datastore layer is responsible for managing fine-grained provenance relationships between data elements, and profiling dataset state.
  • Filestore: Vizier uses a file storage layer to manage large unstructured data.

There are two versions of Vizier under development: Scala and Python. Their architectures are related but different:

Vizier-Scala

Vizier-Scala is a pure scala implementation of the Vizier API.

  • UI: Vizier-Scala uses Web-UI, a React application, as its UI layer. It relies on an underlying REST/HATEOAS API.
  • API Layer: The API layer is implemented directly in Vizier-Scala itself. Notable elements include:
    • The Vizier API object manages the API layer, including spinning up a Jetty server to host it
    • The api package contains handlers for every API call
    • The routes file specifies all API routes
    • The api.servlet package contains the servlets implementing the API.
    • The catalog package implements the API's state model.
  • Scheduler: The Scheduler is implemented directly in Vizier-Scala itself, and in particular in the viztrails package. Notable elements include:
    • The Scheduler object manages workflow state, and the workflow execution lifecycle.
    • The Provenance object contains methods for determining inter-cell dependencies and managing cell states.
  • Datastore: The Datastore is implemented by Mimir as a layer over Apache Spark supplemented with Mimir's Caveats package, Notable elements of the datastore include:
    • The Catalog maintains a record of all existing dataframes and persists a record of how to reconstruct them.
    • The api.request package provides a fixed API to access Datastore functionality.
  • Filestore: The Filestore is implemented directly in Vizier-Scala itself in the Filestore object.

At present, Vizier-Scala's filestore is limited to the local filesystem. Our goal is to eventually merge the Mimir API into Vizier itself, which will allow the filestore to work over HDFS and S3 as well.

Mimir

Mimir is a system for tracking caveats and provenance of SQL queries. Any request for declarative access to a dataaset from the workflow layer goes through Mimir which uses Spark for storage and execution of data flows. Mimir also implements lenses, the data curation operations build into Vizier.

Vizier-Python

Web API-Async is a pure python implementation of the Vizier API.

  • UI: Web API-Async uses Web-UI, a React application, as its UI layer. It relies on an underlying REST/HATEOAS API.
  • API Layer: The API layer is implemented directly in Web API-Async itself. Notable elements include:
  • Scheduler: The Scheduler is implemented directly in Web API-Async itself. Notable elements include:
    • The engine module contains the scheduler's execution logic.
    • The viztrail.module.provenance module contains logic for determining inter-cell dependencies and figuring out which cells to run when.
  • Datastore: Web API-Async has a modular data store implementation defined in the datastore package. Three implementations exist:
    • fs: A simple filesystem-based datastore that uses CSV files for tables. Used mostly for debugging.
    • histore: A datastore based on the histore versioned dataset toolkit.
    • mimir: The Mimir datastore used in Vizier-Scala.
  • Filestore: Web API-Async has a modular file store implementation defined in the filestore package. One implementation exists:
    • fs: A simple filesystem-based file store.