Binoy Vijayan

Posted on Feb 16 • Edited on Nov 1

Decoding Excellence: Blueprinting a File-Sharing Service Inspired by Google Drive and Dropbox

#systemdesign #architecture #filesharing #beginners

This article will delve into the step-by-step process of designing a file uploading and sharing service, exploring the key considerations, architectural decisions, and best practices involved in creating a reliable and user-friendly platform. By following this comprehensive guide, developers and system architects can navigate the complexities of building a file sharing service that meets the evolving needs of today's digital landscape.

Requirements

Functional

User should be able to upload photo/files.
User should be able to create/delete directories on the drive.
User should be able to download files
User should be able to share the uploaded files.
The drive should synchronise the data between user all devices.
User should be able to upload files/photos/videos even when internet is not available. As and when internet gets available, the offline files will be synced to online storage.

Non functional

Availability — Availability means what percentage of the time the system is available to take a user’s request.

Durability — System should ensure the files uploaded by user should be permanently stored on the drive without any data loss.

Reliability — Reliability means how many times the system gives expected output for the same input.

Scalability — System should be capable enough to handle the increasing traffic.

ACID properties — Atomicity, Consistency, Integrity and Durability. All the file operations should follow these properties.

Atomicity — It means that any operation executed on a file should be either complete or incomplete. It should not be partially complete, i.e. If a user uploads a file the end state of the operation should be either the file is 100% uploaded or it is not at all uploaded.

Consistency — The consistency should ensure that the data should be same before and after the operation is completed. Means the file that user uploads should be same before and uploading the file on the drive.

Isolation — It means that 2 operations running simultaneously should be independent and not affect each other’s data.

Durability — System should ensure the files uploaded by user should be permanently stored on the drive without any data loss.

Back of the envelope calculation

Storage

Average File Size: Assume an average file size of 10MB.

Number of Users : let's assume 100,000 users.

Retention Period : Assume the platform retains user files for 1 year.

Redundancy and Overhead : Allocate an additional 20% for redundancy, backups, and system overhead.

Total Storage Required per User :

Average File Size * Number of Users = 10MB * 100,000 = 1,000,000 MB (or 1 TB)

Total Storage Required for All Users:

Total Storage per User * Redundancy Factor = 1 TB * 1.20 = 1.2 TB

Traffic Estimate:

Let's assume each user, on average, uploads and downloads a total of 10GB of data per month.

Total monthly traffic = (Data transfer per user) * (Number of subscribers)

Total monthly traffic = 10GB/user * 100,000 users = 1,000,000 GB

Bandwidth Cost Estimate:

Assuming a cost of $0.05 per GB transferred, the monthly bandwidth cost would be:

Monthly bandwidth cost = Total monthly traffic * Cost per GB

Monthly bandwidth cost = 1,000,000 GB * $0.05/GB = $50,000

Architecture

Black-box components

Frontend Interface(Web/Mobile/Desktop Application or SDK)
Backend Server
File Storage(Amazon S3/ Google Cloud Storage/ Azure Blob Storage)
Metadata Management(SQL / NoSQL)

White-box Components

Application or SDK

Indexer

It scans and analyses the files and folders stored in a user's account, extracting metadata such as file names, sizes, modification dates, and content types.

The indexer enables efficient searching, browsing, and organising of files and folders within the Dropbox interface or through API queries.

Indexer get notified by the watcher if there are any modifications in the files/folders, and indexer get help from the chunker if required.

Indexer updates the internal database with information about the chunks of the modified files. Once the chunks are successfully submitted to the Cloud Storage, the Indexer will communicate with the Synchronisation Service using the Message Queuing Service to update the Metadata Database with the changes.

Chunker

"Chunking" is a technique used for breaking down large data into smaller, more manageable pieces, often used in file transfer protocols. In the context of file upload services like Dropbox, chunked uploads allow large files to be uploaded in smaller pieces (chunks) instead of all at once. This helps with reliability and performance, especially over unreliable network connections.

Chunker splits the files into smaller pieces called chunks. To reconstruct a file, chunks will be joined back together in the correct order. A chunking algorithm can detect the parts of the files that have been modified by user and only transfer those parts to the Cloud Storage, saving on cloud storage space, bandwidth usage, and synchronisation time

Watcher

A component that monitors changes in a system or environment and triggers actions in response to those changes.

In the context of file sharing services like ‘Dropbox’ or ‘Google-drive’, a watcher may refer to a feature that monitors changes to sync folders in real-time and notifies applications or users of those changes.

Watcher notifies the Indexer if there are any modifications in the sync folder

Client-side DB

Light weight data-base which is suitable for client side application to store metadata of the files/folders

Backend Server

Synchronisation Service

The Synchronisation Service is the component that processes file updates made by a client and applies these changes to other subscribed clients. It also synchronises clients’ local databases with the information stored in the Metadata Database.

Messaging Queue Service

Message Queuing Service supports asynchronous message-based communication between clients and the Synchronisation Service instances.

Two types of queues that are used in our Message Queuing Service. The Request Queue is a global queue that is shared among all clients. Clients’ requests to update the Metadata Database through the Synchronisation Service will be sent to the Request Queue. The Response Queues that correspond to individual subscribed clients are responsible for delivering the update messages to each client.

Cloud Storage(Amazon S3/ Google Cloud Storage/ Azure Blob Storage)

Cloud Storage/Block server stores the chunks of the files uploaded by the users. Clients directly interact with the Cloud Storage to send and receive objects using the API provided by the cloud provider.

Metadata Database

The Metadata Database is responsible for maintaining the versioning and metadata information about files/chunks, users, and workspaces. The Metadata Database can be a relational database such as MySQL, or a NoSQL database. we just need to make sure we meet the data consistency.

High-level component diagram of the system

Conclusion

These are the high-level components that should be considered when designing a file sharing service. This does not include the other aspects of the system like security monitoring, etc.

DEV Community