Skip to content

hijohnnylin/sae_vis

 
 

Repository files navigation

This codebase was designed to replicate Anthropic's sparse autoencoder visualisations, which you can see here. The codebase provides 2 different views: a feature-centric view (which is like the one in the link, i.e. we look at one particular feature and see things like which tokens fire strongest on that feature) and a prompt-centric view (where we look at once particular prompt and see which features fire strongest on that prompt according to a variety of different metrics).

Install with pip install sae-vis. Link to PyPI page here.

Features & Links

Important note - this repo was significantly restructured in March 2024 (we'll remove this message at the end of April). The recent changes include:

  • The ability to view multiple features on the same page (rather than just one feature at a time)
  • D3-backed visualisations (which can do things like add lines to histograms as you hover over tokens)
  • More freedom to customize exactly what the visualisation looks like (we provide full cutomizability, rather than just being able to change certain parameters)

Here is a link to a Google Drive folder containing 3 files:

  • User Guide, which covers the basics of how to use the repo (the core essentials haven't changed much from the previous version, but there are significantly more features)
  • Dev Guide, which we recommend for anyone who wants to understand how the repo works (and make edits to it)
  • Demo, which is a Colab notebook that gives a few examples

In the demo Colab, we show the two different types of vis which are supported by this library:

  1. Feature-centric vis, where you look at a single feature and see e.g. which sequences in a large dataset this feature fires strongest on.

  1. Prompt-centric vis, where you input a custom prompt and see which features score highest on that prompt, according to a variety of possible metrics.

Version history (recording started at 0.2.9)

  • 0.2.9 - added table for pairwise feature correlations (not just encoder-B correlations)
  • 0.2.10 - fix some anomalous characters
  • 0.2.11 - update PyPI with longer description
  • 0.2.12 - fix height parameter of config, add videos to PyPI description
  • 0.2.13 - add to dependencies, and fix SAELens section
  • 0.2.14 - fix mistake in dependencies
  • 0.2.15 - refactor to support eventual scatterplot-based feature browser, fix ’ HTML

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 97.7%
  • Jupyter Notebook 1.3%
  • Other 1.0%