Submission Number: 12
Submission ID: 29
Submission UUID: bb5c1d15-ace8-461c-9866-5aad3c664a08
Submission URI: /form/project

Created: Tue, 09/03/2019 - 11:37
Completed: Tue, 09/03/2019 - 11:41
Changed: Thu, 05/05/2022 - 04:48

Remote IP address: 130.215.55.243
Submitted by: David Oury
Language: English

Is draft: No
Webform: Project
Project Title Creating Tailored Research Computing Environments
Program Northeast
Project Image
Tags big-data (4), containers (55), docker (35), gpu (80), machine-learning (272), parallelization (223), python (69), r (32), resources (16)
Status On Hold
Project Leader David Oury
Email doury@bentley.edu
Mobile Phone 312 880 7092
Work Phone
Mentor(s) David Oury
Student-facilitator(s) Snehal Mattoo, David Reitano
Mentee(s)
Project Description The cloud (public and private) provides an array of virtual machines, available with a range of cores, RAM and specialized hardware. Research faculty at small and medium-size institutions have a variety of requirements for computing and data resources, but need to be efficient in their use of these resources. Three current best practices (Terraform, Docker, AWS S3) are an integrated toolset able to provision compute/data resources and configure compute/data services in a customizable and efficient manner. The primary objective of the Tailored Research Environments (directed study) project is to create a web-based product that enables its user to run R, Python and Spark code in Jupyter notebooks on a user configured computer environment. Importantly, the user's notebooks (code) and datasets are stored separately from the computing environment. In addition, compute resources (of this environment) can be added and removed as needed, which will significantly reduce the overall cost of using the product.

---

The project is a solution to the problem of providing researchers with individually tailored compute resources in a cost-efficient manner. We propose to use an integrated set of technical tools to easily create tailored technical research resources in the cloud (AWS, Google, Microsoft and Open Stack.) This integrated toolset is AWS S3, Docker and Terraform.

AWS S3 provides inexpensive long-term persistent storage for datasets, programs, results and associated reports. Terraform provides the means to easily detail, record and share the specs/configurations of an array of technical services (primarily virtual machines, but there are others) from the four providers listed above. In addition, these services can be easily created and destroyed, which is the primary reason for this solution being cost-efficient. Docker provides an extensive selection of, mostly open source, software components that can be run on virtual machines.

The design of this solution follows the pattern wherein the data and code are persistent, but the use of computing resources is short-lived and impermanent. The later either reduce the cost of their use or facilitates the sharing of these resources.

Specific objectives of the project are to create a data analysis environment, which:
- is created and managed using "infrastructure as code" techniques
- provides distributed computing to the user
- Decouples code, data and compute capacity

The specific objectives of students working on the project are to:
- Create a working product
- Create product documentation
- Develop a working knowledge of infrastructure-as-code techniques
- Develop an understanding of distributed computing techniques and a working knowledge of Spark configuration
- Present a single tutorial and demo of the product to the Data Lab and incorporate feedback into the product and documentation
Project Deliverables The project deliverables are:
- Terraform configuration files for creating customized collections of resources
- Documentation on the use of AWS S3, Docker and Terraform to create these resources
- Use of the above tools to make provision specialized hardware from the four cloud providers
Project Deliverables
Student Research Computing Facilitator Profile The student profile should include:
- Command line skills on Linux
- An understanding of IP addresses and ports
- Ability to debug and investigate in an unfamiliar environment
Mentee Research Computing Profile
Student Facilitator Programming Skill Level Some hands-on experience
Mentee Programming Skill Level
Project Institution Bentley University
Project Address 175 Forest Street
Waltham, Massachusetts. 02452
Anchor Institution NE-MGHPCC
Preferred Start Date 09/01/2018
Start as soon as possible. No
Project Urgency Already behind3Start date is flexible
Expected Project Duration (in months)
Launch Presentation
Launch Presentation Date
Wrap Presentation
Wrap Presentation Date
Project Milestones
Github Contributions
Planned Portal Contributions (if any)
Planned Publications (if any)
What will the student learn? The student will learn:
- To use AWS, Google Compute Platform, Microsoft Azure and Open Stack
- How to use Terraform to provision from the above providers
- The Linux command line
- How to run, configure and create Docker containers
What will the mentee learn?
What will the Cyberteam program learn from this project? The Cyberteam will learn:
- Another solution to make use of private and public virtual compute environments
- To create tailored environments for researchers
HPC resources needed to complete this project? Two types of resources are needed:
1.) Public cloud virtual machine resources (i.e. funds to create them)
2.) Private cloud virtual machine resources (Open Stack virtualization of possibly specialized hardware)
Notes I would like to work with at least one Bentley student and would be happy to work with up to 3 non Bentley students. Having a Bentley student would help me communicate to the group.
What is the impact on the development of the principal discipline(s) of the project?
What is the impact on other disciplines?
Is there an impact physical resources that form infrastructure?
Is there an impact on the development of human resources for research computing?
Is there an impact on institutional resources that form infrastructure?
Is there an impact on information resources that form infrastructure?
Is there an impact on technology transfer?
Is there an impact on society beyond science and technology?
Lessons Learned
Overall results