Submission Number: 165
Submission ID: 3769
Submission UUID: af0b1e88-3357-4b39-943a-c850fd242402
Submission URI: /form/project

Created: Tue, 06/13/2023 - 08:10
Completed: Tue, 06/13/2023 - 08:10
Changed: Fri, 04/19/2024 - 14:12

Remote IP address: 173.59.116.37
Submitted by: Vinayak Mathur
Language: English

Is draft: No
Webform: Project
Project Title High throughput Python pipeline to identify Horizontal Gene Transfer
Program CAREERS
Project Image
Tags bioinformatics (277), biology (515), data-wrangling (6), genomics (537), github (490), python (69), workflow (365)
Status Halted
Project Leader Vinayak Mathur
Email vm7027@cabrini.edu
Mobile Phone 7324214925
Work Phone
Mentor(s) Simon Delattre
Student-facilitator(s) Kendrick Key
Mentee(s)
Project Description Project Description: This project seeks to further investigate the genetic phenomenon of horizontal gene transfer (HGT), specifically when involving interactions between bacteriophages and their host bacteria. From a biological perspective, this type of horizontal gene transfer occurs when bacteriophages attach themselves to a bacterial cell and inject it with a vector such as a plasmid that integrates into the host genome and takes control of the bacterium to make copies of itself. The main aim of the project is to develop an analysis pipeline written in Python that automatically generates a large output list of bacterial accession numbers given an input list of phage accession numbers. The current program employs BLAST to create this list of accession numbers.
In the analysis pipeline, the input list is iterated through, and each phage accession number is submitted as a BLAST query to be aligned with the NCBI database of bacterial genes. The top bacterial result for each phage query ID is stored and aligned with the database of bacteriophage genes in turn. A match between the original phage query ID and the phage result of the BLAST search where the bacterial accession number is the query ID indicates the presence of horizontal gene transfer. Conducting this analysis in an HPC environment using SSH could significantly speed up the process of data collection compared to the functioning of the current pipeline or performing manual searches on the NCBI website where BLAST has been made available.

Current version of the pipeline is available here: https://github.com/genomesolver/CSPpipeline

Research goals: This research project has three major goals:
1) Identify instances of HGT in a large dataset of bacteriophage proteins: The data list produced by the program facilitates more in-depth analysis of bacteriophage-mediated horizontal gene transfer.
2) Predict likelihood of HGT: By developing a probabilistic classifier, we can attempt to predict the likelihood that a certain clade of bacteria is affected by horizontal gene transfer given the HGT status of the other members of the clade. This model could assist in establishing the statistical significance of the occurrences of HGT in bacterial relatives and help identify cellular features specific to those groups of bacteria that could potentially explain their vulnerability to infection by phages.
3) Functional analysis: A Gene Ontology (GO) enrichment analysis is another research aim to extract meaningful conclusions from this data. Since the current version of the pipeline generates a list of bacterial accession numbers that correspond to phage query IDs, that list can be processed in order to find GO terms in groups of genes regulated by the integration of the nucleic acids of the bacteriophage. This type of data analysis would be very useful to visualize and increase the understanding of how the phage infections disrupt the genetic network of the bacteria.
Project Deliverables The goals of the project are:
1) To fine tune the already developed Python pipeline to be able to analyze larger datasets
2) Be able to use a offline version of NCBI database to run the analysis
3) Develop a model to be able to predict likelihood of HGT
Project Deliverables
Student Research Computing Facilitator Profile
Mentee Research Computing Profile
Student Facilitator Programming Skill Level Some hands-on experience
Mentee Programming Skill Level
Project Institution Cabrini University
Project Address 610 King of Prussia Road
IAD 224
Radnor, Pennsylvania. 19087
Anchor Institution CR-Penn State
Preferred Start Date
Start as soon as possible. No
Project Urgency Already behind3Start date is flexible
Expected Project Duration (in months) 4
Launch Presentation
Launch Presentation Date
Wrap Presentation
Wrap Presentation Date
Project Milestones
  • Milestone Title: Improvement of Current Pipeline
    Milestone Description: Attempt to improve run time of the current Python pipeline, with the possibility of downloading the NCBI database needed for comparison to a local server space
    Completion Date Goal: 2023-08-15
  • Milestone Title: Develop Classifier
    Milestone Description: Develop the probabilistic classifier for the pipeline, to predict likelihood of HGT in different bacterial clades
    Completion Date Goal: 2023-09-30
  • Milestone Title: Functional Analysis Function
    Milestone Description: Create a functional analysis function to compare to existing Gene Ontology databases. Work on writing a manuscript for publishing the results.
    Completion Date Goal: 2023-10-31
Github Contributions https://github.com/genomesolver/CSPpipeline
Planned Portal Contributions (if any)
Planned Publications (if any) Plan to publish the manuscript in the journal: https://iubmb.onlinelibrary.wiley.com/journal/15393429
What will the student learn?
What will the mentee learn?
What will the Cyberteam program learn from this project?
HPC resources needed to complete this project?
Notes
What is the impact on the development of the principal discipline(s) of the project?
What is the impact on other disciplines?
Is there an impact physical resources that form infrastructure?
Is there an impact on the development of human resources for research computing?
Is there an impact on institutional resources that form infrastructure?
Is there an impact on information resources that form infrastructure?
Is there an impact on technology transfer?
Is there an impact on society beyond science and technology?
Lessons Learned
Overall results