Data Scientists collaborating on a project

Data Science Teamwork Made Easy with Git: A Step-by-Step Guide

In this article I will go through the steps to successfully work as a team on a shared project by using git.

Introduction

Many data scientists tend to work on individual projects to develop their skills in managing tasks independently. However, in order to succeed in the field, it is crucial to learn how to effectively collaborate on larger projects. This includes the ability to share, modify and merge code with other colleagues.

One of the most widely used tools for code collaboration is Git, a distributed version control system that allows multiple people to work on a project at the same time, keeping track of the changes without overwriting each other’s code.

However, Git might appear at first glance complex and intimidating, especially for those who are new to it.

Drawing from my personal experience, I’ve found that the most effective way to familiarize oneself with a new tool is to use it in a project. So, whether you have a side project you’re eager to collaborate on with a friend, or a new work stream that demands your attention at work, now is the perfect time to incorporate Git into your toolkit.

This blog post aims to explain the process of using Git in a team setting. It provides a comprehensive step-by-step guide and a practical checklist for managing a GitHub repository as part of a team. My goal is to help you master the basics of Git with ease, enhancing your collaborative efforts in data science.

Key Git Commands

Let’s start by familiarizing ourselves with the most important git commands you will use when collaborating with other data scientists and explain why they are useful. Here is a list of the basic commands you need to understand and use when working with git repos:

  1. git clone <repository-url>: This command is used to create a copy of a remote repository on your local machine. Let’s say you want to clone this GitHub repo, that attempts to solve the famous Titanic survival analysis. By running git clone https://github.com/feddernico/titanic-survival-analysis.git You will get an exact local copy of the GitHub repo.
  2. git pull: Before making any change to a repo, it’s good practice to ensure you have the most updated version. This command updates your local repository to the latest commits made on the active branch.
  3. git branch: This command lists all the branches available in a git repo. A branch is a common way to divide the code into logical work streams. The main branch is usually the one that contains the stable version of your code.
  4. git checkout <branch-name>: If you want to make changes to the content of a repo, it’s a good practice to branch from the main and give it a meaningful name. The checkout command switches to the specified branch. If you want to create a new branch that is called cool-new-feature you need to use the -b flag so that it will be git checkout -b cool-new-feature.
  5. git add <file>: You made some changes to the new_feature.py file. This command stages a new file file, or changes to a file, for commit. Staging a file for commit means that we are going to register a change to that file to the repo. So, in our case, we should write something like this:
    git add new_feature.py
  6. git commit -m "<message>": This command saves your changes to the local repository. A commit is a way to group the changes made to your repo into a logical step. The option -m in the command indicates that we want to add a meaningful message to our commit. So let’s write something like:
    git commit -m "added a cool new feature to the repo"
  7. git push origin <branch-name>: This command sends your committed changes to the remote repository. So let’s send the changes made on our cool-new-feature branch locally to our remote repository. To do so we have to type:
    git push origin cool-new-feature
  8. Now that we tested our new feature we want to make sure that is included in the main branch of our repo, the one that contains the stable version of our project. In order to do so in GitHub, we need to visit the pull request page of our repo, in our case this is the page to visit.
    By clicking on the new pull request green button on the right end side of the page, we will be able to compare the main and the cool-new-feature branches of the repo and merge them, so that the changes made in the new branch will be included in the main.
    During this last step is usually good to have a peer review of what’s been developed, in order to identify any potential issues or bugs in the code, a phase that is called Quality Assurance or QA.

That ends a development lifecycle that implements a new feature in an ongoing project hosted on a repo.

Managing Projects with Multiple People

When working on a project in a working environment you will have to deal with multiple colleagues, so it’s essential to establish a set of rules that everyone can follow. Here’s a simple workflow that can work well for small teams:

  1. Clone the Repository: Each team member should start by cloning the repository to their local machine.
  2. Create a Branch: Before making any changes, each team member should create a new branch. This allows everyone to work on their own project version without affecting the main codebase.
  3. Make Changes and Commit: After making changes, each team member should stage and commit their changes. Remember to write clear, concise commit messages that explain what changes were made and why.
  4. Push Changes to the Remote Repository: Once changes have been committed, they can be pushed to the remote repository.
  5. Create a Pull Request: After pushing changes, team members should create a pull request. This allows others to review the changes before they are merged into the main codebase.
  6. Review and Merge Pull Requests: Finally, the team should review each pull request and, if everything looks good, merge the changes into the main codebase.

Resolving Conflicts

Conflicts can occur when two or more team members modify the same part of a file. Git is usually able to merge changes automatically, but when it can’t, you’ll need to resolve the conflicts manually. Here’s how:

  1. Identify the Conflict: Git will tell you which files have conflicts that need to be resolved.
  2. Open the File: Open the conflicted file in a text editor. You’ll see markers that indicate where the conflicts are.
  3. Resolve the Conflict: Decide which changes to keep, and remove the conflict markers. This will ensure that the conflicts are resolved
  4. Commit and Push the Resolved Conflict: After resolving the conflict, stage and commit the changes, then push them to the remote repository.

Checklist for Collaborative Projects

Finally, let’s have a look at a list of things to remember when working on a collaborative project.

  • [ ] Clone the repository to your local machine.
  • [ ] Create a new branch before making changes.
  • [ ] Stage and commit changes regularly with clear, concise messages.
  • [ ] Push changes to the remote repository.
  • [ ] Create a pull request for your changes.
  • [ ] Review pull requests from other team members.
  • [ ] Resolve conflicts when necessary, commit, and push the resolved file.
  • [ ] Merge pull requests into the main codebase after review.
  • [ ] Keep communication clear and consistent throughout the process.

Conclusion

Working with Git in a team setting can be a challenge, but with the right workflow and understanding of key commands, it can become a powerful tool for collaboration. Remember, the goal is to manage code and facilitate clear and effective communication among team members. Happy coding!


Posted

in

by