There are many Apps and Services that use Strings as IDs instead of Integers.
While the Google URL Shortener https://goo.gl has recently been discontinued2, you are likely to continue seeing it's short links around the web for the foreseeable future and it still serves as an insightful case study.
The https://goo.gl service is a basic URL shortener that allows people (registered Google users) to paste links into an input field and when the "SHORTEN URL" (shouty) button is clicked/tapped, a much shorter URL is created:
When Google announced they were shutting down the service (for new links), I downloaded the CSV of the links I created for one of my accounts:
Notice how the IDs are all 6 alphanumeric characters? The "character set" is:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
I've never seen a goo.gl short URL with a special character.
But since it's a closed-source product, there's no way of knowing for sure.
With a 63 character set, and an ID length of 6,
there are 63^6 = 62,523,502,209 62.5 Billion IDs.
It was unlikely Google was going to "run out" of IDs,
they must have decided to shut down the service
because it was being used for "SPAM" and "PHISHING" ...
e.g: https://lifehacker.com/5708311/new-virus-watch-out-for-googl-links-on-twitter
People with unethical motives is reason why we can't have "nice things"!
If you have a simple Blog or Site with a few thousand links, you can easily use short URLs with 3 characters. 63^3 = 250,047 250K IDs is enough for even a popular blog! and 63^4 = 15,752,961 15.7 Million IDs is plenty for the first few years of a link sharing service!
We will be integrating this knowledge into our App.
From the start Instagram got several things "right" in both their iOS App, their backend App/API design and infrastructure choices. We will be producing a separate "case study" on Instragram in the near future, meanwhile let's focus on one thing: the post IDs.
The first image posted on Instagram (16th July 2010) was by Kevin Systrom. It features a dog in Mexico near a taco stand, with a guest appearance by his girlfriend’s foot wearing a flip-flop: https://www.instagram.com/p/C/
In many ways this image is representative of the social network as a whole, it uses the "X-PRO2" filter and is totally insignificant to most of the 99,704 people who "liked" it, unless these almost 100K people are general "dog-lovers" who "like" all/any dogs ... 💭
Notice how the URL of this image is /p/C/
?
The /p/
part refers to the
"posts controller" in the
(Django
-based) Web App.
The C
is the the ID
of the post. Yes, String
is used for "ID"!
Mike Krieger
chose to use a String
for the Post IDs
(rather than an Auto-incrementing Integer
)
for three simple reasons:
- Strings can be shorter than Ints because the character set is larger.
if the character set is just numeric digits
0123456789
then the number of potential IDs or "posts" corresponds to the length of the ID. There are only 9999 potential IDs if the ID length is 4 characters (10^4 = 10k, subtract the 0000 ID which would never be used in an auto-incrementing database that starts at 1) - Strings obscure how many posts have been made on the network (whereas an Int would make it immediately obvious how popular the network was!)
- Strings make it more difficult to guess the ID of the next post "scrape" the site's content. This is also good for privacy again because nobody can guess a private Post's ID.
Sure enough, the second post on Instagram is: https://www.instagram.com/p/D/ 24,402 likes ... seems legit. everyone is entitled to their own "taste" in what content to "like".
If we keep going through the alphabet we soon discover that ID F
does not exist:
https://www.instagram.com/p/F/
This could either be because it was never assigned (the system skipped this ID) or because it was deleted.
By December of 2010 Instagram was claiming to have 1 Million Users: https://www.instagram.com/p/pLY-/
And while we have no reason to doubt their claim, there is an obvious incentive for any venture-backed startup to overstate metrics in order to fuel adoption and secure more funding.
It's certainly interesting that a photo of a random Taco stand gets 24k "likes" but the milestone announcement that there are now 1 Million Users on the app gets only 390; you would think people in the early insta-community would have been more excited about this ...? Anyway, back to the IDs!
At the point where they passed 1 Million Users,
the post IDs where 4 characters in length: pLY-
This means there was an "address space" big enough for almost 19M photos
(see below) an average of 19 posts per user.
This pLY-
post ID gives us quite a lot more information.
pLY-
indicates that the IDs are both UPPERCASE
and lowercase
letters of the alphabet (ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
)
and URL-safe characters; in this case a hyphen.
A more recent post on Instagram has an ID of BqHTJ9WHptI
(11 characters):
https://www.instagram.com/p/BqHTJ9WHptI
The BqHTJ9WHptI
post ID tells us
that they are also using Numeric characters
(0123456789
).
So we know they are using a URL-safe human-readable character set which almost matches Base64:
https://en.wikipedia.org/wiki/Base64#Base64_table
The key distinction between the Instagram ID charset and Base64,
is that Base64 allows the forward slash (/
) and plus (+
) characters
which are both reserved characters in URLs.
Which makes us think that Instagram's IDs are more along the lines of
RFC 3986:
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz-~.
66 characters.
With a 66 character set and 4 charter ID length there are 66^4 = 18,974,736 IDs.
With an ID length of 11
(as is the case in the most recent insta posts),
the potential number of IDs is 66^11 = 103,510,234,140,112,521,216 ...
103 Quintillion.
Enough for each of the Earth's 7.5 Billion people
to post 14 Billion images.
It appears that Instagram have inflated raised their ID length in order to
achieve objectives 2 and 3 (described above),
they don't want anyone to know how much
activity there really is on the network.
Instagram are using a distributed system for creating their post IDs.
Youtube is another example of this. Let's look at a few of the URLS for videos hosted on the site.
https://www.youtube.com/watch?v=fYyDQBG_tYc
https://www.youtube.com/watch?v=jNQXAC9IVRw
https://www.youtube.com/watch?v=g5eGKw4TWbU
Notice all the ids are 11 alphanumeric characters. The character set appears to be the same set as the one used by the Google URL shortener
0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
YouTube is a site with 1.9 billion users. There is no way of knowing for sure but it is estimated that there are somewhere between 5-7 billion videos currently on the site and that number is continuously growing as over 300 hours of video is uploaded to the site every minute.
This number of videos is almost hard to wrap your head around and you might think that they would have to change their IDs at some point to reflect the number of videos on the site. Well, let's have a look at that.
With a character set of 63 and ID length of 11 there are 63^11 = 62,050,608,388,552,824,880 IDs possible. 62 Quintillion. To try and give this number some context, there are more possilbe IDs than the estimated number of grains of sand on earth. Close to 9 times more!!!
We can safely say that YouTube will not be running out of video IDs anytime soon.
YouTube, unlike Instagram, has implemented this system since the first upload to the site. In fact one of the 3 links above is the first ever YouTube video and the other 2 are from late 2018. As you can see, there is no way to tell the difference without actually clicking the link.
This shows that YouTube was planning for scale from day one. It's also why there is no way of telling just how many videos are on YouTube.