The traditional file system-and-database web backend is no longer adequate, and must make way for storage systems that manage unstructured data. In this article we will learn about the differences between structured and unstructured data, and why web storage backends must evolve to manage unstructured data.
Traditionally, web applications use file systems and databases to store user data. This is simple to manage, as web applications generate structured data by accepting text input in forms, and saving the input to a database. However, times are changing; with the advent of social media, cloud storage, and data analytics platforms, increasing quantities of unstructured data are being pushed onto the Internet.
IDC conducted a study in 2014 that predicted the unstructured data created and copied all over the world will reach 44 zettabytes, i.e. 44 trillion gigabytes, annually by 2020. This is a 10x increase from the 2013 figure of 4.4 zettabytes. If you are thinking this is a little too much, think about this: unstructured data already account for 90% of all of the digital data in 2015!
So, as with other computing paradigms, storage systems need to evolve to manage this new wave of unstructured data that has hit the Internet. But before we move any further, let me define unstructured data for you. Data that can’t be organized for storage inside a relational database are generally termed as unstructured data. You can have textual or non-textual unstructured data. Text documents, emails, and presentations are examples of textual unstructured data. Examples of non-textual unstructured data include videos, images, and audio files. You can also take a look at this Quora thread to get an idea about the difference between structured and unstructured data.
Why object storage?
We now know that there is a lot of unstructured data being generated, and it needs to be handled in an easy-to-access, yet secure and reliable way. We already have a storage mechanism that people have been using since the start of modern computing, the file system.
So why do we need a whole new storage paradigm? The answer lies in the details. Let us close in a little and understand the requirements.
When we talk about unstructured data and its scale, it is important to understand that the underlying system used to store data should scale very well. But scaling file systems is difficult. Not only do you need to manage the (sometimes) unnecessary metadata and hierarchy that file systems impose on you, there are maintenance considerations such as backup management.
It is not enough to just collect unstructured data. You also need to apply some level of organization to make sense of the data. Techniques like text analytics, auto-categorization, and auto-tagging are crucial to get business sense from all the unstructured data that you collect. This is difficult to achieve with file systems because they have fixed layouts.
File systems aren’t made for HTTP(S), but rather for humans. Sharing and managing files in a file system is difficult to handle programmatically. Handling file streams and the possible boundary cases is error-prone, and takes lot of time and effort.
To bypass all this, something new is needed, something imagined from scratch that keeps the new requirements in focus. This leads us to object storage.
What is object storage
Unlike files in file systems, objects are stored in a flat structure. There is just a pool of objects: no folders, directories or hierarchies. You simply ask for a given object by presenting its object ID. Objects may be local, or on cloud server thousands of miles away, but since they are in a flat address space, they are retrieved exactly the same way.
An important aspect is metadata handling. Object storage provides great deal of flexibility, because object metadata are abitrary. Metadata are not limited to what the storage system thinks is important (think of fixed metadata in file systems). You can manually add any type or amount of metadata. For example, you can assign the type of application the object is associated with; the importance of an application; the level of data protection you want to assign to an object; whether you want this object replicated to another site or sites; when to move this object to a different tier of storage or to a different geography; when to delete this object. And so on, the possibilities are limitless.
It is very important for files to be accessible via HTTP(S), to ensure that the file is easily accessible. Then it can be subjected to analytics or other techniques. Object storage handles this well. Almost all the platforms offering object storage have REST APIs to help you access the files via HTTP(S). Not only the APIs are helpful in accessing data, they also help you authenticate, get the file properties, and manage permissions, all of which you would need to do manually in a file system.
Now that the majority of data on the Internet are unstructured, and pundits are predicting double-digit growth in this trend, it is important to take this challenge head-on. Unstructured data must be stored in an easy-to-access manner, and we must have the tools to make business sense from all these vast quantities of unstructed data we are collecting.
Let us take a look at few of the most popular open source object storage solutions available:
Ceph is a distributed object, block, and file storage platform. Ceph’s software libraries provide client applications with direct access to the RADOS object-based storage system, and also provide a foundation for some of Ceph’s advanced features, including RADOS Block Device (RBD), RADOS Gateway, and the Ceph File System. (See An introduction into Ceph storage for OpenStack.)
OpenStack Swift is a highly available, distributed, eventually consistent object/blob store. Written in Python, Swift supports REST APIs and other clients to access data. (Read more Opensource.com articles about Swift.)
A version of this article was previously posted at Minio Blog. Reposted with permission and under Creative Commons.