This article will delve into the step-by-step process of designing a file uploading and sharing service, exploring the key considerations, architectural decisions, and best practices involved in creating a reliable and user-friendly platform. By following this comprehensive guide, developers and system architects can navigate the complexities of building a file sharing service that meets the evolving needs of today's digital landscape.
Requirements
Functional
User should be able to upload photo/files.
User should be able to create/delete directories on the drive.
User should be able to download files
User should be able to share the uploaded files.
The drive should synchronise the data between user all devices.
User should be able to upload files/photos/videos even when internet is not available. As and when internet gets available, the offline files will be synced to online storage.
Non functional
Availability — Availability means what percentage of the time the system is available to take a user’s request.
Durability — System should ensure the files uploaded by user should be permanently stored on the drive without any data loss.
Reliability — Reliability means how many times the system gives expected output for the same input.
Scalability — System should be capable enough to handle the increasing traffic.
ACID properties — Atomicity, Consistency, Integrity and Durability. All the file operations should follow these properties.
Atomicity — It means that any operation executed on a file should be either complete or incomplete. It should not be partially complete, i.e. If a user uploads a file the end state of the operation should be either the file is 100% uploaded or it is not at all uploaded.
Consistency — The consistency should ensure that the data should be same before and after the operation is completed. Means the file that user uploads should be same before and uploading the file on the drive.
Isolation — It means that 2 operations running simultaneously should be independent and not affect each other’s data.
Durability — System should ensure the files uploaded by user should be permanently stored on the drive without any data loss.
Back of the envelope calculation
Storage
Average File Size: Assume an average file size of 10MB.
Number of Users : let's assume 100,000 users.
Retention Period : Assume the platform retains user files for 1 year.
Redundancy and Overhead : Allocate an additional 20% for redundancy, backups, and system overhead.
Total Storage Required per User :
Average File Size * Number of Users = 10MB * 100,000 = 1,000,000 MB (or 1 TB)
Total Storage Required for All Users:
Total Storage per User * Redundancy Factor = 1 TB * 1.20 = 1.2 TB
Traffic Estimate:
Let's assume each user, on average, uploads and downloads a total of 10GB of data per month.
Total monthly traffic = (Data transfer per user) * (Number of subscribers)
Total monthly traffic = 10GB/user * 100,000 users = 1,000,000 GB
Bandwidth Cost Estimate:
Assuming a cost of $0.05 per GB transferred, the monthly bandwidth cost would be:
Monthly bandwidth cost = Total monthly traffic * Cost per GB
Monthly bandwidth cost = 1,000,000 GB * $0.05/GB = $50,000
Architecture
Black-box components
Frontend Interface(Web/Mobile/Desktop Application or SDK)
Backend Server
File Storage(Amazon S3/ Google Cloud Storage/ Azure Blob Storage)
Metadata Management(SQL / NoSQL)
White-box Components
Application or SDK
Indexer
It scans and analyses the files and folders stored in a user's account, extracting metadata such as file names, sizes, modification dates, and content types.
The indexer enables efficient searching, browsing, and organising of files and folders within the Dropbox interface or through API queries.
Indexer get notified by the watcher if there are any modifications in the files/folders, and indexer get help from the chunker if required.
Indexer updates the internal database with information about the chunks of the modified files. Once the chunks are successfully submitted to the Cloud Storage, the Indexer will communicate with the Synchronisation Service using the Message Queuing Service to update the Metadata Database with the changes.
Chunker
"Chunking" is a technique used for breaking down large data into smaller, more manageable pieces, often used in file transfer protocols. In the context of file upload services like Dropbox, chunked uploads allow large files to be uploaded in smaller pieces (chunks) instead of all at once. This helps with reliability and performance, especially over unreliable network connections.
Chunker splits the files into smaller pieces called chunks. To reconstruct a file, chunks will be joined back together in the correct order. A chunking algorithm can detect the parts of the files that have been modified by user and only transfer those parts to the Cloud Storage, saving on cloud storage space, bandwidth usage, and synchronisation time
Watcher
A component that monitors changes in a system or environment and triggers actions in response to those changes.
In the context of file sharing services like ‘Dropbox’ or ‘Google-drive’, a watcher may refer to a feature that monitors changes to sync folders in real-time and notifies applications or users of those changes.
Watcher notifies the Indexer if there are any modifications in the sync folder
Client-side DB
Light weight data-base which is suitable for client side application to store metadata of the files/folders
Backend Server
Synchronisation Service
The Synchronisation Service is the component that processes file updates made by a client and applies these changes to other subscribed clients. It also synchronises clients’ local databases with the information stored in the Metadata Database.
Messaging Queue Service
Message Queuing Service supports asynchronous message-based communication between clients and the Synchronisation Service instances.
Two types of queues that are used in our Message Queuing Service. The Request Queue is a global queue that is shared among all clients. Clients’ requests to update the Metadata Database through the Synchronisation Service will be sent to the Request Queue. The Response Queues that correspond to individual subscribed clients are responsible for delivering the update messages to each client.
Cloud Storage(Amazon S3/ Google Cloud Storage/ Azure Blob Storage)
Cloud Storage/Block server stores the chunks of the files uploaded by the users. Clients directly interact with the Cloud Storage to send and receive objects using the API provided by the cloud provider.
Metadata Database
The Metadata Database is responsible for maintaining the versioning and metadata information about files/chunks, users, and workspaces. The Metadata Database can be a relational database such as MySQL, or a NoSQL database. we just need to make sure we meet the data consistency.
High-level component diagram of the system
Conclusion
These are the high-level components that should be considered when designing a file sharing service. This does not include the other aspects of the system like security monitoring, etc.
Top comments (0)