Skip to content

This shows the results of the ultrametric tree-based, explainable, solar-powered language model

Notifications You must be signed in to change notification settings

solresol/ultratree-results

Folders and files

NameName
Last commit message
Last commit date

Latest commit

dd6a9f6 · Mar 18, 2025
Jan 20, 2025
Jan 9, 2025
Jan 17, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Jan 16, 2025
Jan 17, 2025
Mar 18, 2025
Mar 18, 2025
Jan 10, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Jan 24, 2025
Jan 1, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Dec 3, 2024
Jan 21, 2025
Mar 18, 2025
Mar 18, 2025
Mar 18, 2025
Feb 25, 2025
Mar 18, 2025
Feb 25, 2025

Repository files navigation

ultratree-results

This shows the results of the ultrametric tree-based, explainable, solar-powered language model

These charts update each day.

Key Charts

Levels of carefulness

Training up an ultrametric tree by finding the optimal split at each step is computationally prohibitive. We can only subsample. Each order of magnitude increase in carefulness is roughly three orders of magnitude more compute time required. "Sense Annotated 1" is the alias of the first training of Careful1000, which seems like a reasonable compromise. It requires about 100 times as many nodes to achieve the same result as Careful10000, but it can train 1000 times faster.

Careful100 and Careful10 are much, much faster to train, but there's a threshold somewhere between Careful100 and Careful1000 where there are too many bad choices. It's open question what that threshold is, and why a threshold even exists.

Carefulness Levels

Does sense annotation work?

The key question that this work set out to answer was whether sense annotation, and indeed, the whole idea of synergistic semantic and statistical models were worth exploring.

The "Unannotated Model 1" can be seen as being a baseline statistical model. It's equivalent to a one-hot encoded decision tree. The sense annotated model's learning generalises where the unannotated model is overfitting very early.

Sense annotated vs Unannotated

Reproducibility and variance in models

Broadly speaking, re-training on the same data yields similar results. Loss on the hold-out training data goes down, roughly linearly with the logarithm of the number of nodes in the model. Note that these are only sorted by time (the model that was trained first). It's just co-incidence that model 1 is the best and model 5 the worst.

Even the worst model is doing much better than the unannotated model. The probability of this happening by chance is 1/32, which is equivalent to a p-value of 0.03.

Reproducibility of model training

Ensembling

Ensembling works. The ensemble of 5 "Careful 1000" models gets results that don't look all that different to an extrapolation of the best of them.

Total Loss vs Model Size for Sense Annotated Models Including Ensembling

Baseline Comparison

Comparison with a neural baseline shows that the best-trained ultrametric trees need a few orders of magnitude more nodes than a neural network needs trainable parameters. But different ultrametric training regimes have several orders of magnitude difference within them, so it's not hard to believe that a better training regime might close this gap.

Weirder is that here sense annotation makes barely any difference to the neural network models.

Neural Network Results

Noun loss

Instead of looking at the total loss over all parts of speech, we would expect that nouns would get the most benefit from having sense annotation into a hierarchy.

But the data shows the exact opposite: as we train, we are increasing the loss on nouns, which means that the loss on all other parts of speech much be dropping even more rapidly.

Noun Loss vs Model Size

We do see that the ultratree models soundly outperform neural network models on nouns though. Neural networks are behaving as one would expect: larger models have more generalised learning.

Noun Loss vs Neural Networks

Theory: the ultrametric models mostly predict nouns, because nouns are the most common part of speech in the corpus, and they can group parts of speech together into an aggregate. The neural network mostly predicts punctuation, since it has no way of aggregating parts of speech together without internalising rules of grammar. The .'' character is the most common word'' in the corpus, so all else being equal, it will get predicted more often.

Context usage

We can see which contexts get used for node splitting. (This is not the same as asking which nodes get used the most often in inference.)

Histogram of context usage

Everything Else

Total Loss

Total Loss vs Model Size

Total Loss vs Model Size for the Careful 10000 model

Noun Loss

Noun Loss vs Model Size for Sense Annotated Models Including Ensembling

Noun Loss vs Model Size for the Careful 10000 model

Time Views

Total Loss vs Time

Noun Loss vs Time

Model Node Count vs Time

Model Complexity

Average Depth vs Time

Average In-Region Hits vs Time

Context Usage

Sense Annotated

Unannotated

How to reproduce these results

Download the TinyStories data set, and sense-annotate some of it

Clone github.com:solresol/wordnetify-tinystories.git

Follow the instructions in the README.md there.

I stored the sense-annotated training data in /tinystories/wordnetify-tinystories/TinyStories.sqlite and the sense-annotated validation data in /tinystories/wordnetify/w2.sqlite

Make an ultrametric tree model

Clone github.com:solresol/ultrametric-trees and follow the instructions in README.md there, including running the cronscript.sh to export results.

I stored the prepared data (and did training) in /ultratree/language-model/tiny.sqlite and the the validation data in /ultratree/language-model/validation.sqlite

Make a baseline comparison

Clone github.com:solresol/ultratree-neural-baseline and follow the instructions in the README.md file there.

About

This shows the results of the ultrametric tree-based, explainable, solar-powered language model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages