My GSoC 2017 Experience with Apache OODT
This post is regarding my GSoC 2017 project, Reworking OODT configuration management to make use of Apache Zookeeper. An introduction to my project, how I came up with a design and the implementation along with the problems faced are described here. Also the future developments/improvements possible are described at the end of this post as well.
What is Apache OODT?
Having originated at NASA Jet Propulsion Laboratory, Apache OODT (Object Oriented Data Technology) consists of a collection of components which can be used for information integration and data processing. These components include,
- A File Manager which is responsible for file management, meta data management and data movement.
- A Work Flow Manager which is capable of handling data flow and control flow.
- A Resource Manager which is responsible for job assignment to nodes and monitoring.
Apache OODT also includes an automated crawling framework, a remote data acquisition system, and a science algorithm “wrapper” called CAS-PGE (Catalog and Archive Service Production Generation Executive) that makes it easy to run algorithms in a reusable, repeatable, and traceable fashion.
My Project: [OODT-945] Distributed Configuration Management for OODT
Mentors
Tom Barber
- Tom is the current head of OODT, a NASA developer, Director at Spicule and the creator of Saiku Analytics.
- Tom is the currently the head of Apache OODT project and he is the one who came up with the requirement of this project.
Chris Mattman
- Chris is a Principal Data Scientist and the Chief Architect in the Instrument and Data Systems section, at the Jet Propulsion Laboratory (JPL) in Pasadena, California.
- Chris was introduced to me later by Tom and eventually became an unofficial mentor of my project. He helped me to test and refine my implementation.
Overview
All OODT components mentioned above have their own configuration. These configuration are usually stored in XML files or properties files. When setting up clusters of these components (say file manager), users have to put more and more effort into copying configuration into servers and modifying configuration files to match the requirement. This is a tiresome task and makes it complex to manage. Additionally, this complexity creates problems when the platform is distributed across multiple servers or locations.
Objective
Main objective of this project was to introduce distributed configuration management to major OODT components; file manager, resource manager and workflow manager using Apache zookeeper as the distributed storage with the objective of reducing the above mentioned complexity. Corresponding JIRA issue (OODT-945) gives further information on the requirement.
Deliverables
Following are the deliverables mentioned to be delivered by the end of the project period (see project proposal) and I’m happy to say that I was able to complete all of them as expected.
Completed distributed configuration management module
- Newly created config module available at https://github.com/apache/oodt/tree/feature/zookeeper-config/config
- All work done is available in feature/zookeeper-config branch (of Apache OODT main repository) at https://github.com/apache/oodt/tree/feature/zookeeper-config.
Unit tests and integration tests for the new module
- Unit tests to test the functionality of configuration publishing mechanism and configuration loading mechanism available at https://github.com/apache/oodt/tree/feature/zookeeper-config/config/src/test.
- Integration tests with simulated OODT component behaviors
- For File Manager, Resource Manager and Workflow Manager.
Documentation on “How to use the distributed configuration management module”
- Available at cwiki.
A developer documentation
- Available at cwiki.
Design and Implementation
Apache Zookeeper was used as the distributed storage based on several reasons.
- Zookeeper has been out there for some time and is stable. It guarantees Consistency and Partition Tolerance.
- In addition to a distributed storage, zookeeper is a complete distributed coordination scheme. Features like zookeeper watches could be used to notify configuration changes.
- Zookeeper is also from the Apache ecosystem.
Based on Zookeeper’s features, following design was suggested for the distributed configuration management implementation.
Design and The Idea
A ZNode structure was designed to address our requirement where there can be multiple projects with the same set of OODT components, but running with different configuration. All the design related information is documented in confluence wiki.
On a high level, there can be multiple projects running different sets of file managers, workflow managers and etc. To store configuration for different projects separately in zookeeper, the concept of projects are introduced and the root ZNode is divided to sub trees based on the project. Under each project there are configuration stored for different OODT components (file manager, resource manager, …). Basically, there are two types of configuration files that needs to be stored, properties files and other configuration files (like XML files and etc). Out of those, properties within the properties files are only loaded once (normally at the initialization). Therefore, they are treated as an special case in contrast to other configuration files. In order to achieve this, two sub trees were created for properties files and other configuration files for each component.
Read more on design to understand how this design works at runtime.
Implementation
A new module was created named config to accommodate the distributed configuration management implementation code. Then the implemented configuration management code was integrated into file manager, resource manager and workflow manager. As a result, 3 pull requests were sent and accepted.
Since distributed configuration management is optional, users should be allowed to select one of file based (current) configuration manager and distributed configuration manager. How this selection is done through environment variables and what are the steps needed to be taken to enable distributed configuration management is documented at Using Distributed Configuration Management CWIKI section. Also “Distributed Configuration Publishing” section explains how the configuration should initially be stored using the configuration publishing CLI.
Pull Requests
- Initial implementation of distributed configuration management — https://github.com/apache/oodt/pull/44
- CLI for configuration publishing — https://github.com/apache/oodt/pull/48
- Integrated distributed configuration management to workflow and resource managers — https://github.com/apache/oodt/pull/50
All these pull requests are merged to feature/zookeeper-config branch of the main repository.
Features Implemented
Distributed Configuration Manager capable of,
- Downloading and storing configuration published for each component under corresponding project.
- Listening for configuration changes, downloading new configuration, deleting existing configuration according to the notification and notifying Configuration Listeners about the configuration change occurred so that the interested parties can take necessary actions.
Distributed Configuration Publisher capable of,
- Publishing configuration to zookeeper.
- Verifying the configuration published to zookeeper are identical to what the original configuration files contains.
- Clearing configuration from zookeeper.
- Notifying the listening parties about the configuration changes.
A Configuration Publishing CLI capable of,
- Exposing all the functionalities provided by the Distributed Configuration Publisher.
- Has options to notify configuration listeners about the configuration change.
Challenges faced
Coming up with a design for Zookeeper ZNode structure
- This was the most critical part of this project. Without a good and expandable design at zookeeper level, moving forward was not feasible. Therefore, the initial design suggested by me in the proposal needed to be changed multiple times. For example in the earlier stages, this structure didn’t have the concept of projects.
- Tom and Chris helped me to identify areas where the design needed to be improved. After discussing further with Tom, we could agree on the current state which can be considered as a good design.
Implementing a convenient mechanism to publish configuration
- With distributed configuration management, a convenient way to publish configuration needed to be introduced. With few development iteration, I could come up with a CLI to publish configuration.
Testing the distributed configuration management implementation
- First I implemented simple unit tests to test the functionality with an inbuilt testing zookeeper cluster.
- Doing the integration tests and manually verifying the functionality was the hardest task out of all. Initially I tried running these OODT components from bash scripts with a separate zookeeper server up and running. But this wasn’t enough to be full confident about the implementation. At this point, Chris and Tom suggested me to run DRAT(Distributed Release Audit Tool) since it internally use OODT file manager, workflow manager and resource manager. Through this method I could verify that my implementation is 100% functional in real world.
Future Development
Extending distributed configuration management to a distributed command framework
At the moment, even with distributed configuration enabled:
- We have to login to a remote server
- Install/unpack corresponding OODT component
- Start it (with no manual configuration since configuration is downloaded on the fly). We need to set ZK_CONNECT_STRING environment variable prior to that.
- If we need to restart a component, then we have to login to that server as well.
If we can extend our zookeeper based configuration management to a command framework, we can simply restart/refresh the entire component or the configuration as required with just a simple terminal command in a local machine.
Introducing distributed configuration management to crawler and pcs packages
As per the moment, distributed configuration management only support 3 main components of OODT, file manager, resource manager and workflow manager. It would be great if this feature was introduced to above mentioned packages as well.
Allow file manager clients to query multiple file managers as one
Currently for file storage and data archiving there would have to be an NFS mount and stuff. Once file managers are configured, they are not aware of the other file managers operate in the cluster. If we can allow the file managers to know about each other, then we can extend that to clients being able to query a range of file managers as if they were one.
Thanks Tom, Chris, OODT Team and The ASF!
This was the first time I participated in GSoC. Even though I had some previous experiences working with open source communities, I could never get this much of exposure.
Tom was a great mentor who did not push me saying do this and do that. Instead, he let me come up with a design and when I came up with a one, he reviewed it and asked me to adjust wherever necessary. And the most important thing is that, both of us support Arsenal :D
Chris was always willing to share his knowledge and was very helpful providing quick reviews on my pull requests and merging them. He also helped me when I ran into problems while developing.
Finally, I would like to thank the OODT team and the ASF as a whole for every piece of advice and support given through the mailing list. Without these people, my mentors and OODT team, I wouldn’t have been able to complete this project and I hope that I was able to give something back to you. You will be seeing me for quite some time as I’m planning to contribute in future as well.
Thank you Google!
Yes, without google I wouldn’t get this opportunity to meet these amazing people and get this amazing experience by contributing to OODT. Thank you Google! Keep it up!