Building a Simple Version Control System with Python

Version control is an essential tool in the world of software development, allowing developers to track and manage changes to code over time.

Today, we’re going to explore how to build a simplified version control system (VCS) using Python. This project is perfect for intermediate Python users looking to deepen their understanding of file manipulation, hashing, and data serialization.

Why Build a Version Control System?

While modern VCSs like Git and SVN are complex and feature-rich, building a basic VCS from scratch can provide invaluable insights into how these systems work under the hood.

Our project will focus on creating and reverting snapshots of a directory, simulating the core functionality of tracking changes over time.

Getting Started

Our VCS will use Python’s hashlib for hashlib, os for file and directory operations, and pickle for serialization. The system will operate through a simple command-line interface, supporting operations to initialize the VCS, create snapshots, and revert to a specific snapshot.

Initialization

First, we’ll import our standard libraries and create an initialization function that creates a directory to hold our snapshots.

import os
import hashlib
import pickle

def init_vcs():
    os.makedirs('.vcs_storage', exist_ok=True)
    print("VCS initialized.")

The Snapshot Function

The snapshot function is where the magic starts. It walks through the directory specified, reads each file, and calculates a cumulative hash using SHA-256, a secure hashing algorithm.

Each file's path and content are stored in a dictionary, which is then serialized and saved to a file named after the snapshot's hash.

This process ensures that each snapshot is uniquely identified by the content it includes.

def snapshot(directory):
    # Initialize a SHA-256 hash object to compute the snapshot's unique hash
    snapshot_hash = hashlib.sha256()
    # Prepare a dictionary to hold snapshot data, including a sub-dictionary for file contents
    snapshot_data = {'files': {}}

    # Walk through the directory, capturing the directory tree and files
    for root, dirs, files in os.walk(directory):
        for file in files:
            # Skip files within the .vcs_storage directory to avoid self-referencing
            if '.vcs_storage' in os.path.join(root, file):
                continue
            # Construct the full path to the file
            file_path = os.path.join(root, file)
            # Open and read the file's content in binary mode
            with open(file_path, 'rb') as f:
                content = f.read()
                # Update the snapshot's hash with the file content
                snapshot_hash.update(content)
                # Store the file content in the snapshot data
                snapshot_data['files'][file_path] = content

    # Finalize the hash calculation for the snapshot
    hash_digest = snapshot_hash.hexdigest()
    # Save the list of files in the snapshot for later reference
    snapshot_data['file_list'] = list(snapshot_data['files'].keys())
    # Serialize and save the snapshot data to a file named after the snapshot's hash
    with open(f'.vcs_storage/{hash_digest}', 'wb') as f:
        pickle.dump(snapshot_data, f)

    # Print a confirmation with the unique hash of the created snapshot
    print(f"Snapshot created with hash {hash_digest}")

The Revert Function

To revert to a snapshot, we load the serialized snapshot data, restore each file’s content, and ensure that files not present in the snapshot are removed. This process brings the directory back to the exact state it was in at the time of the snapshot.

def revert_to_snapshot(hash_digest):
    # Construct the path to the snapshot file based on its hash
    snapshot_path = f'.vcs_storage/{hash_digest}'
    # Check if the snapshot exists; if not, print a message and exit the function
    if not os.path.exists(snapshot_path):
        print("Snapshot does not exist.")
        return

    # Load the snapshot data from the file
    with open(snapshot_path, 'rb') as f:
        snapshot_data = pickle.load(f)

    # Iterate over each file in the snapshot
    for file_path, content in snapshot_data['files'].items():
        # Ensure the directory for the file exists, creating it if necessary
        os.makedirs(os.path.dirname(file_path), exist_ok=True)
        # Write the file content back, restoring it to its snapshot state
        with open(file_path, 'wb') as f:
            f.write(content)

    # Prepare to identify and delete files not part of the snapshot
    current_files = set()
    # Walk through the current directory structure
    for root, dirs, files in os.walk('.', topdown=True):
        # Skip the .vcs_storage directory
        if '.vcs_storage' in root:
            continue
        # Add each file to the set of current files
        for file in files:
            current_files.add(os.path.join(root, file))

    # Create a set of files that were part of the snapshot
    snapshot_files = set(snapshot_data['file_list'])
    # Determine which files currently exist but weren't in the snapshot
    files_to_delete = current_files - snapshot_files

    # Delete each file that should not exist based on the snapshot
    for file_path in files_to_delete:
        os.remove(file_path)
        # Print a message for each file removed
        print(f"Removed {file_path}")

    # Print a confirmation of reverting to the specified snapshot
    print(f"Reverted to snapshot {hash_digest}")

Putting It All Together

Next, we’ll implement the main function to tie everything together.

if __name__ == "__main__":
    import sys
    command = sys.argv[1]

    if command == "init":
        init_vcs()
    elif command == "snapshot":
        snapshot('.')
    elif command == "revert":
        revert_to_snapshot(sys.argv[2])
    else:
        print("Unknown command.")

To test, from the command line, run python vcs.py init to initialize, python vcs.py snapshot to create a snapshot, and python vcs.py revert <hash> to revert.

And there you have it, a simple but functional version control system in Python! While it’s a basic implementation, it lays the groundwork for understanding more complex systems like Git.

Experiment with it, and consider adding more features, like diff viewing or branch management. Thanks for reading, and happy coding!

Resources

GitHub Repo: https://github.com/musicalchemist/simple_vcs

Thank you for reading! If you enjoyed this post and would like to stay up to date, then please consider subscribing.

NOTE: This article was originally written and published in 2024 on my previous personal blog and on Medium.