X429midterm

ISTA429 Midterm project for Team X for MCLAS2021


Project maintained by jake-newton Hosted on GitHub Pages — Theme by mattgraham

X429midterm

ISTA429/PLS 529 Midterm project for Team X

Collaborators:

Introduction

Climates change effects are a growing concern for people around the world. Agricultural yield in recent years has been drastically affected by the frequent changing climatic conditions and thus, the need for improving current agricultural breeding programs. One way this can be achieved is through the use of machine learning algorithm for accurate prediction of various plant genotypes yield. For our midterm project, we utilized a Soybean dataset consisting of 103,365 performance records over 13 years and 150 locations with weekly weather data for each location throughout the growing season of 30 weeks. We trained various machine learning models to accurately predict yearly crop yield using the records provided. We found that the use of an ensemble model set up, where we averaged multiple models to gain one good prediction from multiple lesser models produced a better prediction.

Getting Started

In order to get started, you must set up a data directory.

First, make sure you have anaconda on your machine. Next, make a conda env using the environment.yml file with conda env create -f environment.yml. If needed, activate the env with conda activate info529midterm. Once here, we can run the scripts to organize data.

data
├── clusterID_genotype.npy
├── inputs_others_train.npy
├── inputs_weather_train.npy
└── yield_train.npy

These are necessary for handle_data.py to work. This script will set up all other data for you.

Final data dir structure:

data
├── clusterID_genotype.npy
├── combined_data_train.npy
├── combined_data_validation.npy
├── combined_weather_mgcluster_214_development.npy
├── inputs_others_train.npy
├── inputs_weather_train.npy
├── scaled_yield_development.npy
├── scaled_yield_train.npy
├── scaled_yield_validation.npy
└── yield_train.npy

As you can see, there is training data and validation data created for us. Also, we can run this script, minus the data splitting, to format our test data as well. Development data is the entire “training” set before it was split

Running existing model_train scripts

If you have the environment set up on your machine, you can run the script like any other Python script. This is the same for if you are in an interactive session on the HPC.

If you are trying to train the model through a slurm script, please run sbatch model_train.script. To see if it’s running, I like to use squeue -u [netID].

Interested in writing your own script? That’s okay - just make sure all your log files are pointed to a logs directory.

Documentation / After Action Review

Lessons Learned:

At the start of this project, only 2 members of our team had much experience with building a machine learning algorithm. The main learning curve for the rest of us was understanding how the models functioned and which parameters to tweak during testing. Additionally, a few members of our team had a steep learning curve understanding using the HPC and how to configure the scripts and python files to run the code successfully. Overall, this project has boosted our team’s understanding of the start to finish process of designing a model, preparing the data, troubleshooting, and making adjustments.

Technologies Learned:

A few members of the team were very proficient in using HPC and navigating in the terminal, however, the rest of us had a steeper learning curve in this area. We also learned a lot from our team spokesperson, Ryan, on setting up a virtual environment and all the necessary moving pieces in getting a model up and running on the HPC. Additionally, we learned quite a bit about different models, namely ensemble models, when we were researching to improve our predictions.

Methodologies Learned:

One of the methodologies our team explored was the use of an ensemble model set up, where we averaged multiple models to gain one good prediction from multiple lesser models.

10 Rules to Cultivate Transdisciplinary Collaboration in Data Science:

Develop Reflexive Habits: Developing good reflexive habits are integral to working on an interdisciplinary team to understand each other and stay on the same page. This can be accomplished through regular communication and exposure to other disciplines and the way they operate so that you can learn the technical language and understand how to communicate and facilitate collaboration. On our team, we had a fairly wide array of backgrounds, which meant at the beginning everyone assumed different roles to fill a need. As a result of this, I think our team had a good opportunity to develop good reflexive habits since we were all learning from each other throughout the project. Additionally, since we had to quickly assemble a team and work together, I think we all asked questions early on to get on the same page, so we understood the midterm assignment and our individual roles.

Communicate the project management plan early and often: Having a plan regarding project management early on allows for team members to select the tasks they are interested in or are experienced in and contribute to the project on their own terms. Our plan started by assessing each team member’s strengths and weaknesses and assigning roles that suit each member’s strengths. Those who have experience in Machine Learning and Data Management were put in their respective management roles and they distributed tasks to others where they felt necessary. Similarly, the members with knowledge of plant sciences were put in charge of model weights and interpretation.

Speak the same language: One of the challenges in transdisciplinary research is understanding the technical terms used in the different collaborating fields. Getting everyone to understand basic terms used in the fields of the people involved in the project was our first focus. To achieve this, we utilized a glossary of common terms. We also encouraged group members to take time to explain any new technical term not contained in the glossary whenever they use one. We also tried to create a safe environment for everyone to make their input to the project irrespective of their skill level.

Design the project so that everyone benefits: Our team is made up of members with varying data science skills. To get everyone to participate in working on the project, the project was broken down into smaller bits, and responsibilities were assigned based on team members’ skill sets. During all the team meetings, each member gave an update on what they were working on and which tools they were using. This was so that other members of the team could learn a thing or two about other people’s responsibilities.

Fail early and often: Failing early and often is very typical when you are programming no matter what programming language you are working with. Failing often allows for you to see the progress you are making, get a bug out of the way that might trick you up, and allows you to test different methods of trying to run code. When you fail it allows for your team member to help you understand what the error is, and they might have an easy solution. A team failing early means that they are not waiting till the last minute to try and get things done which in turn will produce bad work and cause a lot of headaches. As you learn more about coding languages you will always run into problems that you haven’t seen before. Just looking at the class we have taken as a group at the University of Arizona I noticed that even professors that have been teaching and doing programming for many years have struggled at times to figure out errors and it takes them a bit of time to find a solution or a better method that will work. Group members that wait for a while and don’t encounter errors early slow down the progress of the other group members which in turn makes the timeline very difficult to follow and you may not allow your team to get the best results possible.

Share collaboration tools: Collaboration tools are critical to ensure everyone is on the right task and not repeating work or missing tasks. We are using the Cyverse Wiki to post our results and work that is ready for public view. Google Docs is being utilized for collaborating on the written assignments, Slack is being used for inter-team communication and comments on progress. GitHub is our central repository for files and data. Setting a data manager at the beginning of the project is crucial to ensure files aren’t lost or overwritten. Getting everyone on the same applications for collaboration was the first step in collaborating effectively. Manage your data like the collaboration depends on it Data management is critical to the success of any development team, maintaining file integrity and preventing data loss will make or break any effort in data science. Version control is usually the best solution to data management problems, it allows each individual to contribute to the project while ensuring that their effort does not compromise the valuable data or overwrite another individual’s work. The collaboration effort is enabled by the data custodian, who can ensure the source files are not affected by every attempt made to analyze the data. Once the source data is changed or lost by one person, the data is lost by all, and progress is either slowed or entirely halted on the project. The way that we prevented this was to implement GitHub version control on our code and separate the source files from the output files.

Write code that others can use and reproduce: Reproducible code is the result of proper documentation and commenting, if others cannot understand the method or logic behind the code, the code becomes an artifact, unique and lost to time. By commenting on one’s code, the logic becomes readable, even if the code becomes outdated, the method can be implemented in a different language or updated to run properly. Documentation is key because it fills in the knowledge gaps and contextual information to the problem and the code’s solution to that problem. Regarding the code itself, there is a principle in Computer Science that speaks to generics, if the goal is to implement a data structure that sorts data a certain way, the code should be able to handle different types of information. The code should be able to handle integer inputs as well as strings, even certain kinds of objects or structures. By implementing this principle, code can be used across platforms and across computer languages. We accomplished this by documenting our methods on GitHub and ensuring that our code is open source.

Observe ethical hygiene: As mentioned on the website, keeping in mind how our techniques and practices could be used outside of our intended application helps ensure those techniques are not used for malicious purposes. We carefully selected the right License to accompany our project so that there is transparency in how our data is used. Transparency allows us to keep an eye on non-legitimate uses of techniques. In addition to preventing malicious use, following ethical guidelines is meant to be a framework for improving accountability, responsibility, and reproducibility within the community.

Document your collaboration: Journaling progress towards the project goal is an important habit to benefit learning and understanding after project completion. This allows other groups or individuals to see the difficulties presented by the project and see the steps taken to solve those problems, and perhaps make different choices or understand the choices made. Luckily for us, Slack and GitHub track changes and commit messages to allow for easy retrospective analysis of our thought processes. We decided to opt towards a more data-driven approach in journaling, where commit messages and an after-action review are the primary means of documenting our difficulties and challenges.

Acknowledgements

We would like to thank Emmanuel Gonzalez, Eric Lyons, and Nirav Merchant for guidance and direction in Machine Learning, High-Performance Computing, and Data Analysis for this project.