Make Complex Collaborative Big Data Analysis Easy
We are standing at the era where “data-driven” approaches are adopted in almost every discipline, ranging from astrophysics to biomedical research and healthcare. Throughout the process of turning data into information and knowledge, scientists and decision makers need to:
- Tap into complex, big, and heterogeneous datasets;
- Use various tools to process and analyze such datasets; and
- Collaborate closely with others for complementary expertise.
The process of extracting information from data is complex and requires adequate information infrastructures for big data, which are composed of several data and computing services. For various reasons, information infrastructures are typically complex, and therefore difficult to build and use:
- They include a large number of heterogeneous, distributed, and evolving software components and services;
- Service interfaces often expose low-level technological details and require configuration and customization for the specific applications; and
- Services are fragmented, requiring manual and complex logistics before they can be used effectively (e.g., converting and transferring files, installing software libraries).
It is necessary to integrate and package powerful, but complex, information infrastructures into customized systems that are more streamlined for the application, more efficient and easier to use.
Web-based customized interfaces have been successfully adopted to provide easy access to potentially complex systems. Here we exploit the particular concept of “Science Gateways”, a term coined in 2005 in the US. “Science Gateways” are web-based enterprise information systems that facilitate access to information infrastructures in the form of customized and community-specific interfaces to data collections, computational tools, and collaborative services. In other words, Science Gateways integrate and customize infrastructure services into one system to provide workflow automation, to increase usability and scalability, and to guarantee security and traceability.
The AMC e-Science research group has six years of experience in the research, design, development, and operation of Science Gateways for biomedical research. Five generation of Science Gateways software have been produced progressively based on the evaluation, reflection and experience collected from the previous generations. These gateways are deployed and used by researchers in real use-case scenarios. The gateways have been proven beneficial for researchers. With the gateways, researchers can perform complex collaborative big data analysis on infrastructures easier and quicker. As a result, gateway users could perform better high-quality research, which is shown by higher number of publications in prestigious international journals.
The latest generation is coined “Rosemary”, which is a software platform that can be customized programmatically to develop Science Gateways for various domains and applications. Rosemary provides functions not limited to:
- Data management: integrate heterogeneous datasets, organize data in well-defined structures, browse datasets, search and filter data based on metadata, import and export datasets to/from various data sources using various formats and protocols.
- Computing management: provide a repository of tools (apps) with version control, ability to process datasets with tools on remote high-performance infrastructure, automate logistics such as transporting data and piping them through various tools, search and filter performed data processes.
- Collaboration management: fine-grained role-based access to datasets and tools for teams, notification center for latest news within teams, integrated message threads with reference to datasets and data processes.
- Traceability management: tracks data manipulations in a well-defined structure, enables traceability (provenance) and reproducibility.
The diagram on the right presents a simplified view of the architecture of Rosemary. It is organized into back-end and the front-end functions:
- Back-end: provides all mentioned functions through an Application Programming Interface (API), which facilitates integration into different systems. The back-end interfaces with diverse data and computing infrastructures. It can be used independently from Rosemary’s front-end.
- Front-end is a web-based rich internet application that is responsive to screen dimensions (e.g., smartphone, tablet, desktop).
Rosemary presents a platform for development of complete big data solutions, from data to computation and collaboration management. So far, it has been used to develop three Science Gateways for specific domains and applications:
- Computational Neuroscience gateway: complex and multi-site neuroscience data management, computing management on distributed computing platforms (grid), collaboration management among scientists, and traceability management.
- In Vitro Fertilization gateway: complex multi-site IVF data management, data sharing management specifically to follow a workflow that involves authorization from multiple centers for each requested dataset, and traceability management.
- Genomics gateway: complex wet-lab and dry-lab data management, collaboration management among scientists, and data traceability management.
The following screencast presents the Computational Neuroscience gateway based on Rosemary:
Rosemary is open source and free for non-commercial purposes.
An overview of Rosemary is found on the paper:
S. Shahand and S. D. Olabarriaga, “Rosemary: A Flexible Programming Framework to Build Science Gateways,” in Proceedings of the 8th International Workshop on Science Gateways (IWSG), June 2016 , Rome, 2016.
To get more information about Rosemary and know more about possible collaboration opportunities please contact us.