System Design - Design TinyURL

Functional requirements

  1. Client send a long URL, return a short URL to client;
  2. Client send a short URL, return a long URL if it exist and redirect to the long URL, else return error code;
  3. Client send a long URL, and a customized short URL, return short URL to client;
  4. Client send a long URL, and set expiration time of the tiny URL;

Non-Functional requirements

Performance – minimum latency Scalability Stability Availability - high

Numbers Estimation

Number of writes per month = number of new long URLs = 500 million Number of reads per month = number of short URLs = 100*500 million

QPS

New URLs per second: \(500 * 1000000 / (30 * 24 * 3600) =~ 200 URL/s\) URL redirection: \(200 * 100 = 20000 URL/s\)

Assume each URL is about 500 Bytes.

Storage Requirement

\[500 M * 500 B * 12 = 3 TB / year * 5 years = 15 TB\]

Bandwidth

For writes:

\[200 URL/s * 500 B = 100 KB/s\]

For reads:

\[20000 URL/s * 500 B = 10 MB/s\]

Cache Memory Requirement

Assume 20% of the daily URLs need to be cached:

\(10 MB/s *24 * 3600 = 900 GB/day data flow\) \(20% * 900 GB = 180 GB\)

System API

String long2ShortURL(String longURL) – client send a long URL and get a return of short URL String short2LongURL(String shortURL) – client send a shortURL and get a return of long URL

Length of the short URL – how much short it should be?

One year new URL = 500 M *12 = 6 B new URLs. Letters could be: 0 ~ 9: 10 a ~ z: 26 A ~ Z: 26 Totally 62 letters. With 6 letter long, we can get 60^6 URLs = 56 B, more than enough for one year of new URLs. Pick 6 letters as the length of short URLs.

How to convert long URL to short URL?

Hash code

Take M5 or SHA hash of the long string and take the previous 6 letters form the hash code. Disadvantage: still might take repeated short URL with different long URL. Could be mitigated by add a sequence number at the end of the hash code.

Key generation service

A backend process that generates 6 letters randomly beforehand. Whenever a new long URL request is coming in, assign one to the new long URL. Two tables need to be used, one is for used shortURL, the other is for unused shortURL. When a new longURL is coming, go to the unused table, and get one for the longURL, and write the pair to the used table, and then return to the client.

Unique Id generation

Maintain a unique Id in the system, when a new long URL coming in, increment the id and convert the id to a short URL, which is basically a 62 based number if we use the letters mentioned above. Each of the application server could maintain a ID generator, zookeeper may be used to coordinate the Id generation Disadvantage: if distributed in different computers, id is difficult to maintain unique.

Database design

Schema:

id shortURL longURL creat_time expiration_time other information

SQL or NoSQL

Every item is independent, no relation between them. Billions of data need to be stored. NoSQL is a better option for scalability.

Availability

Each serve could be replicated using master-slave structure. Read could goes to various slave computer, while write goes to master only. When read longURL from each of the slave computer, it should check expiration time of the URL, when it expires, return error code to user. Eventual consistency will achieved.

Partition/Sharding

Partition storage by range based. Could result in unbalanced server load.

Consistent Hash based partition: take hash of each longURL and distribute it to different partitions. Using consistent hash.

Load balance

Load balance could be placed between client to application server, application server to database.

Cache

Cache stores key/value pairs of longURL and shortURL. Application goes to cache for shortURL, if not there, application goes to database to get the shortURL, backend calculate the shortURL and update the database. Database then send back to client and also update the cache. LRU eviction rule should be used because the URL most likely used in a short period of time.

DB cleanup

When a URL get expired, it should be removed from the database. If actively searching for expired URL, it will put a lot of pressure to the database. We can leave the URL there until a client acquire that URL and then delete it from there. We can also sweep the DB during night time when the load is less, and remove any expired item.

Written on May 20, 2021