System Design - Design TinyURL
Functional requirements
- Client send a long URL, return a short URL to client;
- Client send a short URL, return a long URL if it exist and redirect to the long URL, else return error code;
- Client send a long URL, and a customized short URL, return short URL to client;
- Client send a long URL, and set expiration time of the tiny URL;
Non-Functional requirements
Performance – minimum latency Scalability Stability Availability - high
Numbers Estimation
Number of writes per month = number of new long URLs = 500 million Number of reads per month = number of short URLs = 100*500 million
QPS
New URLs per second: \(500 * 1000000 / (30 * 24 * 3600) =~ 200 URL/s\) URL redirection: \(200 * 100 = 20000 URL/s\)
Assume each URL is about 500 Bytes.
Storage Requirement
\[500 M * 500 B * 12 = 3 TB / year * 5 years = 15 TB\]Bandwidth
For writes:
\[200 URL/s * 500 B = 100 KB/s\]For reads:
\[20000 URL/s * 500 B = 10 MB/s\]Cache Memory Requirement
Assume 20% of the daily URLs need to be cached:
\(10 MB/s *24 * 3600 = 900 GB/day data flow\) \(20% * 900 GB = 180 GB\)
System API
String long2ShortURL(String longURL) – client send a long URL and get a return of short URL String short2LongURL(String shortURL) – client send a shortURL and get a return of long URL
Length of the short URL – how much short it should be?
One year new URL = 500 M *12 = 6 B new URLs. Letters could be: 0 ~ 9: 10 a ~ z: 26 A ~ Z: 26 Totally 62 letters. With 6 letter long, we can get 60^6 URLs = 56 B, more than enough for one year of new URLs. Pick 6 letters as the length of short URLs.
How to convert long URL to short URL?
Hash code
Take M5 or SHA hash of the long string and take the previous 6 letters form the hash code. Disadvantage: still might take repeated short URL with different long URL. Could be mitigated by add a sequence number at the end of the hash code.
Key generation service
A backend process that generates 6 letters randomly beforehand. Whenever a new long URL request is coming in, assign one to the new long URL. Two tables need to be used, one is for used shortURL, the other is for unused shortURL. When a new longURL is coming, go to the unused table, and get one for the longURL, and write the pair to the used table, and then return to the client.
Unique Id generation
Maintain a unique Id in the system, when a new long URL coming in, increment the id and convert the id to a short URL, which is basically a 62 based number if we use the letters mentioned above. Each of the application server could maintain a ID generator, zookeeper may be used to coordinate the Id generation Disadvantage: if distributed in different computers, id is difficult to maintain unique.
Database design
Schema:
id | shortURL | longURL | creat_time | expiration_time | other information |
SQL or NoSQL
Every item is independent, no relation between them. Billions of data need to be stored. NoSQL is a better option for scalability.
Availability
Each serve could be replicated using master-slave structure. Read could goes to various slave computer, while write goes to master only. When read longURL from each of the slave computer, it should check expiration time of the URL, when it expires, return error code to user. Eventual consistency will achieved.
Partition/Sharding
Partition storage by range based. Could result in unbalanced server load.
Consistent Hash based partition: take hash of each longURL and distribute it to different partitions. Using consistent hash.
Load balance
Load balance could be placed between client to application server, application server to database.
Cache
Cache stores key/value pairs of longURL and shortURL. Application goes to cache for shortURL, if not there, application goes to database to get the shortURL, backend calculate the shortURL and update the database. Database then send back to client and also update the cache. LRU eviction rule should be used because the URL most likely used in a short period of time.
DB cleanup
When a URL get expired, it should be removed from the database. If actively searching for expired URL, it will put a lot of pressure to the database. We can leave the URL there until a client acquire that URL and then delete it from there. We can also sweep the DB during night time when the load is less, and remove any expired item.