I wrote about Facebook ‘Big Data Pile‘ some weeks ago already, but at the end of last week Facebook’s VP of Engineering Jay Parikh showed some invited guests at Facebook HQ just how big this data pile actually is. And no surprise here: it is getting bigger. Fast. Big data means business for Facebook, it’s what provides insights. It enables the social network to understand user sentiment en modify designs accordingly in nearly real time for instance. It also benefits advertisers because Facebook can perform in-depth analysis over how ads are running across the platform and where they are most successful. But just how big is this pile of data? Over at TechCrunch a picture was posted showing some impressive numbers:
- 2.5 billion content items shared per day (status updates + wall posts + photos + videos + comments)
- 2.7 billion Likes per day
- 300 million photos uploaded per day
- 100+ petabytes of disk space in one of FB’s largest Hadoop (HDFS) clusters
- 105 terabytes of data scanned via Hive (Facebook’s Hadoop query language) every 30 minutes
- 70,000 queries executed on these databases per day
- 500+terabytes of new data ingested into the databases every day
They also told attendees that logfiles keep track of who is accessing all this data and that only developers working on new products are granted acces in the first place. Facebook also created an intensive training process around acceptable use of user data and maintain a zero-tolerance policy: sniffing in data you don’t have permission for gets you fired.
For more coverage on the event check out the post on TechCrunch for info on Project Prism and this picture that shows the life of data on Facebook.