Software development is changing. Mastering pull requests in Git and GitHub is like learning diplomacy in coding. This article starts aims to explain the details of this art. It presents a guide for both the raiser and the reviewer. At its core, the process is a delicate balance. It is about proposing improvements in a positive way. But, you must also protect the codebase’s integrity. The process embodies a team spirit that fuels innovation. The idea is to foster an environment where quality, speed, and mutual growth are key. This helps ensuring that each pull request is not only a submission. It is a step towards excellence in data science and software.
Why are pull requests important
How skilled you should be at developing software is a hot topic for Data Scientists. In an ideal world, Data Science should be the intersection between Software. It should be where software development, machine learning, and domain expertise meet (look at the fancy Venn Diagram). We all know that, as always, the reality is a bit more complex than what it looks like.
That means we, as data scientists (yes, I am guilty too), are not good at software development. We lack the basics of what good software should look like.
Whenever we work on an analysis or a model on our own, we can indulge in adding and sharing all our work on the main branch. But, as soon as we start working in a team, this approach becomes inefficient or even harmful.
This happens when colleagues work on the same project. Usually, they end up editing the same scripts. We can use a powerful tool that git and GitHub offer: pull requests. It creates order in what will otherwise become a spaghetti code club.
But first, let’s look at the details of how a pull request works and why it is important for a Data Scientist in a team. If you don’t know what a pull request is, or need a git refresher for Data Science teamwork, see my article: Data Science teamwork made easy with git: A step-by-step guide.
Best practices for pull request raisers
Start with a clear goal
Having a clear goal for each pull request is key. It ensures fruitful collaboration between data scientists. This clarity is vital. It’s especially important for code changes that involve data models or analytical algorithms. These changes need careful examination to ensure accuracy and efficiency.
A clear and concise goal for each pull request is essential to fostering a productive collaboration among team members, including data scientists. A well-defined goal helps to streamline the review process, set clear expectations, and enable contributors to understand the context and purpose of the proposed changes. This clarity is particularly crucial when code changes involve complex data models or analytical algorithms, which require thorough examination to ensure accuracy and efficiency.
By setting a clear objective, team members can focus their efforts on achieving a specific outcome, thereby reducing the risk of scope creep or miscommunication. Additionally, it encourages meaningful feedback and discussions that can lead to innovative solutions and improvements. Maintaining this level of clarity and purpose in pull requests is a fundamental practice that supports the continuous integration and delivery pipeline, thereby enhancing the overall productivity and quality of the collaborative development process.
Include all the relevant information
Sometimes we are lazy (again, I’m guilty your honor). We want to finish a script we’ve worked on for weeks. So we create a pull request that only contains a title and nothing more. This is bad. We are not letting the reviewer understand the context of a change or its logic.
The best way to ensure that we have all the needed information is to use a template for each pull request. The template should contain the following points:
- context: describe the changes made in simple terms
- fully functioning code: does the code run without errors?
- unit tests: did you test the code before raising the PR?
- seek feedback: include any specific area where feedback is sought
Make smaller, incremental changes
Once I needed to review a PR where 84 files where changed. Whenever you end up raising such a big PR, consider how hard it is to review it. This is especially true for someone who lacks the context or didn’t work on the same project.
Reviewing something this big requires a considerable amount of time. Instead of taking a more agile approach, we can define the max number of files that one can change in a single PR. This will help the review process. It will enable faster integration and reduce the risk of big errors.
Use naming and organisation conventions
Following a consistent naming throughout the project and company is key. It’s important to ensure that everyone aligns on code standards. This practice helps maintain readability and understandability of the code changes.
Some examples of python code styles used in big tech companies:
Include Tests and Documentation
Ensuring the code runs without errors is crucial. It must also do what’s intended. This is another key part of a good codebase. A good pull request should have tested code. It should also have updated documentation. This ensures no surprises. The tests cover issues that might arise when the code runs.
Best Practices for Reviewers in Data Science
Approach with a Constructive Mindset
Constructive criticism delivered with a positive mindset can enhance the collaborative process. When a reviewer takes this view, they enable a more open and productive environment. Focus on suggesting improvements rather than pointing out flaws. This not only motivates the raiser but also fosters a sense of partnership. This philosophy also ensures continuous learning for both the reviewer and the raiser. Each interaction becomes an opportunity for their growth.
Understand the Context
Before reviewing code, you must understand the pull request’s context. This understanding forms the foundation for constructive feedback that aligns with project objectives. Reviewers should know the pull request’s goal. What issues does it address? How does it contribute to the project’s goals? Does it add features or improvements? How do they fit the current development cycle?
Taking time to understand the context goes beyond reading the code changes. Look at the related docs. Also, check the commit messages and any linked tickets or user stories. They can show the developer’s goals and the expected outcomes of the pull request.
Asking clarifying questions
So, the context or goal of the pull request is not clear from the provided information. Then, it’s essential to seek clarity.
- Don’t hesitate to ask questions. If any aspect of the code or its purpose is not clear, ask for clarification. This can cover technical details. It explains the reason for specific choices. It also explains how the changes affect the codebase.
- Use comments. Most version control platforms let you comment on pull requests. Use this feature to ask pointed questions. Use it for explanations of complex code parts.
- Engage in a dialogue. Create a place where the raiser can explain without feeling defensive. Frame questions to help understanding.
Reviewers help by encouraging clear communication. They also understand the context of the changes. This ensures the review is thorough, fast, and leads to useful changes.
Focus on the big picture
The PR reviewers should focus on the big picture. They should consider how the changes fit into the project. They should consider the changes’ impact on existing functions. Not getting too bogged down in minor style issues is key. They don’t much affect the project’s goals and that is important. Following a style guide is important. But, in a project, we should welcome the small personal differences in coding styles.
Provide Clear, Actionable Feedback
The reviewer should also focus on giving clear feedback. The feedback should be something the raiser can use to improve the pull request. Using specific examples or suggesting alternative implementations helps clarify the feedback shared.
Foster a Culture of Learning and Improvement
The review process plays a key role in a Data Science team. It fosters a culture of learning and improvement in the team. The reviewer should always seek ways to encourage sharing. They should do this during the review process.
Example
I am a fan of explaining with examples. So, here’s an example of a well-crafted PR in a simple Python project. It adds a new feature: a function to calculate factorials. This example will follow the best practices outlined in the article
Title of the Pull Request
Add factorial function to math utilities
Description
## Overview
This pull request introduces a new function, `calculate_factorial`, to our collection of mathematical utility functions. The addition of this function aims to extend our utility module's capabilities, allowing it to support factorial calculations which are frequently required in combinatorial mathematics and algorithmic challenges.
## Changes Made
- Implemented `calculate_factorial` in `math_utils.py` which takes an integer input and returns its factorial.
- Added unit tests in `test_math_utils.py` to ensure the correctness of the factorial function across a range of inputs, including edge cases like 0 and 1.
## How to Test
1. Pull this branch into your local environment.
2. Run the unit tests using the command: `python -m unittest discover -s tests`.
3. All tests should pass, verifying the correct implementation of the factorial function.
## Related Issue
This PR addresses the feature request outlined in Issue #1234.
## Screenshots/Output Snippets
For a quick verification, here’s an example output of the function when input is 5:
>>> from math_utils import calculate_factorial
>>> calculate_factorial(5)
120
As always, I appreciate your feedback and am open to any suggestions or changes you think might improve this implementation. Thank you for considering this addition to our project!
Key Points in This Pull Request Example:
- Title: The title is concise, clearly indicating what the PR aims to add without requiring the reader to dive into the details immediately.
- Overview: Provides a summary of what this PR is about and the rationale behind it.
- Changes Made: Lists the specific changes made in this PR, enhancing transparency and making it easier for reviewers to understand the scope of the PR.
- How to Test: Includes clear testing steps, allowing reviewers to easily verify the functionality added or changed.
- Related Issue: Links the PR to any related issue(s), providing context and ensuring it’s easy to track the development process.
- Screenshots/Output Snippets: Offers a quick way to see the result of the change, aiding in the review process.
Conclusion: Harnessing the Collective Wisdom
- Pull requests are a catalyst for collaboration and knowledge sharing within Data Science teams. Following best practices can turn these requests into chances to grow and improve.
- Improving how we handle pull requests makes the codebase strong and efficient. It also shows our expertise. This keeps our projects at the cutting edge. Every contributor drives them with diverse insights and careful work.
- Embracing both raisers’ and reviewers’ views in pull requests enriches our understanding. It shows the collaborative journey in software development. It helps us appreciate it, leading to better, more resilient, and innovative solutions.