Data-centric Cloud Computing

Co-Chairs: James Cuff (FAS Assistant Dean and Distinguished Engineer for Research Computing at Harvard University); Mercè Crosas (Chief Data Science and Technology Officer at the Institute for Quantitative Social Science (IQSS) at Harvard University) and Ignacio M. Llorente (Full Professor in Computer Architecture and Technology, Universidad Complutense de Madrid) 

Data is growing faster than ever before, more data has been created in the past two years than in the entire previous history. By the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet. Our accumulated data will grow from 4.4 zettabytes today to around 44 zettabytes, or 44 trillion gigabytes. The number of devices is also quickly growing. By 2020, we will have over 6.1 billion smartphone users globally and there will be over 50 billion smart connected devices in the world, all developed to collect and share data. The operation of these large volumes of data in order to get their insights in real time presents new challenges and opportunities for future cloud infrastructures. Addressing these needs requires a redesign of future cloud and data center architectures and how they can interoperate to created distributed environments.

Explosive data growth requires changes in the internal organization of the clouds. Most cloud architecture designs have been driven by advances in compute power, and the performance of the I/O subsystem usually only had a secondary design consideration. As future clouds are increasingly used to capture, process, and archive large volumes of data, we have to rethink cloud architectures to focus on data storage and management. This means a paradigm shift in data center design from computational to data exploration that requires infrastructures to be data-centric in order to minimize data movement by bringing compute to data.

The NSF-funded New England Storage Exchange (NESE) project is an example of local regional collaboration to solve these large data issues. NESE is designing and building a multi-petabyte object store that will be supported by the same university partners that founded the Massachusetts Green High Performance Computing Center (MGHPCC). Other example is Dataverse, a popular framework for sharing, preserving, and analyzing research data, that is been ported to the cloud within the Massachusetts Open Cloud (The NSF-funded New England Storage Exchange () project. MOC is a collaboration between higher education, government, and industry.

On the other hand, recent trends in cloud computing go towards the development of data-aware distributed clouds as opposed to the current compute-centric centralized model. Centralized clouds are appropriate for services with limited data communication, such as the Web, or for batch processing, but not for applications that require moving large amounts of distributed data or with interactive users that require low latency and real time processing. Meeting these latency demands requires bringing resources as close to the devices as physically possible. Examples are the Edge Computing paradigm, which have emerged as a key enabling technology for the advancement of emerging Internet of Things use cases like for example smart cities, or the re-architecting of the telco central offices as data centers, which is being performed by most Telecom companies worldwide to face the challenges of explosive growth of mobile data driven by applications like video streaming, social networking, and gaming.

This Study Group will evaluate the main challenges to implement a data-centric approach in cloud computing including:

- Rethinking cloud architectures to move the compute closer to the data, colocating compute closer to data within architecture might have significant benefits in cost and performance.

- Building or using distributed clouds to co-locate data processing services across different geographies to provide the required quality of service and functionality.

- Designing novel distributed computing techniques to stretch datacenter computing facilities to its edge in order meet the communication and processing demands of data analytics use cases from Internet of Things and mobile applications.

- Efficient and agile deployment and management of data-intensive applications with strict geo-distribution, location awareness and low latency requirements in distributed clouds

- Internet-wide storage systems for fault-tolerant, cost-effective and scalable data gathering, sharing, processing and dissemination