The Internet of Things (IoT) is changing the way we generate data and how we use it to gain insights, but the challenge is still in how we manage this diverse data and store it in a meaningful way.
In the early days of big data, we were forced to use rudimentary tools like Flume, Kafka, Storm or even Spark to load data. These products have all helped, but they were difficult to manage and, in many cases, we would build a framework around these tools to make them more efficient. There was a need for better ways to ingest any type of data faster. There was a need for tools similar to traditional extract, transform and load (ETL) tools, which had a good user interface along with the flexibility to create and manage data loading processes. To that end, we are seeing an evolution in the Hadoop ecosystem and specifically in the area of data ingestion.
The emergence of Hadoop projects like Gobblin and Hortonworks Data Flow (HDF) are seen as leading options in the area of data ingestion. As data loading for big data becomes more complex, we need these types of facilities to allow us to load data quickly into Hadoop Distributed File System (HDFS) with little development effort. These products can now be used to enhance loading, reduce time to analysis and support an Agile approach to data. Both tools are interesting and should be considered. Gobblin, originally developed by LinkedIn, is open source and is not as refined as HDF. It allowed LinkedIn to load the massive amounts of data its website and tools were generating on a daily basis. It then shared this technology, allowing all of us to use and contribute to its development.
On the other hand, HDF is a new product being launched by Hortonworks to help not only with the loading of data but also to help with data governance. It is based on the open-source project NiFi, which was developed for the National Security Agency in the United States, but it has since been further enhanced by Hortonworks. With it you can plan and execute loads for any data source. Whether your data arrives from a file, database or stream, all of this data can be loaded via a single interface. This flexibility should help those who cut their teeth on ETL tools. These tools help you address the complexity of loading a diverse set of data inputs and can enhance your organization’s ability to analyze data with lower rates of latency.
So, in the world of IoT, we need to be ready to load an unprecedented range of data formats, which are both old and new. In the end, we must ensure that in spite of these diverse needs we find ways to simplify and shorten the time to analysis.