Have you ever wondered, despite all these years and an absolutely insane amount of video data being generated. Why YouTube haven't run out of space? Especially with hits like these:
This is insane, right? Imagine a platform bursting with millions of videos, yet never facing a space crunch. And even if you try to counter it with cloud computing, at the end of the day, it's just physical hardware or hard disks sitting somewhere in a data center in the name of the cloud :)
From Petabytes to Exabytes:
YouTube operates at an unprecedented scale, storing petabytes and exabytes of video content to cater to its vast user base. To put this into perspective, a single petabyte is equivalent to one million gigabytes, while an exabyte is one billion gigabytes. Managing such immense volumes of data is insane🤯.
*So, the question arises: *
- What's the limit ?🤔
- How do they never lose anything?
- How can any data be accessed instantly for anywhere in the world?
Let's delve into a deeper, more fascinating story behind YouTube's seemingly infinite storage capabilities.
And, don't worry, I'm not gonna fool you into cloud computing XD
Beyond the Cloud
Well, it does make sense when the maximum quality used to be 720p, but now most videos need to be stored in 4K. They must have developed some special compression algorithms or methods to minimize the size.
If they were to rely solely on cloud storage, it would require enormous space and be costly, regardless of the company's size, especially considering that anyone can upload vast amounts of data for free.
First take: Compression Magic
The only reasonable explanation includes data compression or some compression algorithm. Videos are compressed before storage using cutting-edge codecs, like VP9, H.264, H.265 (HEVC) and AV1. This reduces file size by up to 50%, significantly stretching storage capacity without compromising quality.
However, this must be done in a way that does not compromise quality at all. Nonetheless, with general compression, no matter how effective it is, there is still minimal loss during compression to maintain performance and speed.
This does sound like a Pied Piper's revolutionary compression algorithm from series "Silicon Valley" XD
In addition, YouTube utilizes advanced transcoding and optimization techniques to encode uploaded videos into multiple formats and resolutions, catering to various devices and network conditions. Adaptive bitrate streaming further enhances the user experience by dynamically adjusting video quality based on available bandwidth and device capabilities.
Second take: Storage Tiers
Tiered Storage is one of the main factors as videos aren't stored in a monolithic cloud. YouTube employs a tiered system, where frequently accessed content resides in high-performance, readily accessible storage (think lightning-fast SSDs), while less-viewed videos migrate to colder, more cost-effective tiers (like hard drives). This optimizes latency, performance and storage costs.
Third take: Content Lifecycle Management
Content Assessment: YouTube constantly analyzes videos to understand their popularity and engagement. Videos with low viewership or engagement are flagged for archival or removal, freeing up space for fresh content.
(But still there are tons of inactive accounts with all their old videos)
Partner Programs: YouTube offers monetization options for creators. Videos enrolled in such programs are typically retained longer due to their potential revenue generation.
Technology Advancements:
Emerging Technologies: YouTube actively explores cutting-edge technologies like DNA storage, which offers exponentially denser storage compared to traditional methods. While still in its early stages, it holds vast potential for the future.
Moore's Law: Storage capacity consistently increases, driven by advancements in hardware technology. This allows YouTube to accommodate growing video libraries while maintaining cost-effectiveness.
What about availability?
Well If you talk about just the availability of this huge data, then it is because of:
- Global Network: YouTube's storage infrastructure isn't confined to a single location. It's distributed across data centers worldwide, ensuring redundancy and resilience. If one data center experiences an outage, others can seamlessly take over, preventing service interruptions.
- Content Replication: Popular content is replicated across different data centers. This ensures it's readily available to viewers near them, minimizing latency and buffering issues.
What's the available information?
Google uses Google File System (GFS) and BigTable to manage the large amount of data. They have millions of disks that are in a RAID configuration across multiple data centers. I found an answer on twitter from 'TechWelthEngine' that sounds plausible.
"At 4.3 petabytes a day, it takes just over 232 days to get to an exabyte. If we assume that they have 15 EB of storage, then that means it'll take them 9.5 years to fill it all at this pace."
But if this is true then do they have to built a new 15Eb facility every 9.5 years?
I am not really sure. May be they will just dedupe any redundant data?
And don't forget the fact that the 4.3 petabytes a day will increase over the coming years specially with a huge number of videos are being created and narrated by AI!
And if they are really just constantly upgrading their servers(which obviously they are not) then it explains why we have to watch 2 ads, then 1.5 minutes of the actual video, then 2 ads, then 3 minutes, then the process repeats :)
So I believe there must be a way because they can't keep building server farms forever and ever....
I tried to contact YouTube and some senior developers at YouTube to get a more clear view on this, but so far there has been no response.
Hence, the question remains unanswered just how long can YouTube hold onto our data in the cloud, what are the YouTube's archival processes?
What do you think about it? Do let me know in the comments.
Inspired by twitter/X talks with Ben Weddle.
If you enjoyed this blog, you can follow me on:
If you'd like to support me, you can sponsor me on GitHub or buy me a coffee.
Top comments (8)
Google does not use RAID configuration.
Google use horizontal scaling.
In horizontal scaling, the complexity of adding data is 1 (O(1)), the complexity of removing data is 1; but editing is more complicated. If you have noticed, in many big services you cannot edit and you have to delete and add data again.
Appreciate the insight! Horizontal scaling definitely offers advantages in terms of scalability and complexity.
About editing data, then yes Ig it can sometimes be more complex due to the distributed nature of horizontal scaling. It's fascinating to see how different approaches like this contribute to the architecture of large-scale services.
They also compress old videos that people are not watching anymore.
Some videos were available at 1080p. Now they are just at 240P.
So, if one server farm is full, it will have some room in months or years.
This does make sense! But does this mean that in the future, we won't be able to watch videos in the same quality as those currently at 4K? :)
However, the videos I uploaded like 6-7 years ago on YouTube are still the same.
Youtube's compression does loose some quality like you said. This video makes it visible to the human eye and talks about why they think this happens and how the lossless compression parts work. youtube.com/watch?v=JR4KHfqw-oE
Thanks for sharing the video! It's a great explanation of how compression on YouTube happens. It does make sense on how YouTube manages its vast library of videos. Appreciate the insight!
I found this insightful. You’ve been able to point out plausible answers to the question. I believe there remains some other things/technology that has been implemented together with your highlighted solutions that Google is using to keep up with the vast amount of data generated on Youtube
Thank you for your feedback! I'm glad you found the insights helpful and it's quite possible that Google has implemented various other innovative technologies and strategies to efficiently handle the ever-growing data on their platform but the question is will they let us know about it