GoogleDriveBackup.Core
(License: This article, along with any associated source code, is licensed under The LGPL 2.1)
In today's digital world, cloud storage solutions like Google Drive offer incredible convenience. However, relying solely on them for critical data isn't foolproof. Data loss can still occur due to accidental user actions, account issues, malicious attacks, or simply the need for an offline, independently verifiable copy. Existing backup tools often didn't meet my specific requirements for format, efficiency, and resilience.
This led to the development of a custom Google Drive backup tool using C# and .NET. This multi-part article details its creation. Part 1 focuses specifically on the heart of the application: the GoogleDriveBackup.Core library. We'll explore the motivations behind its design, its internal architecture, and the solutions implemented to handle challenges like incremental backups, Google Workspace file exports, parallel processing, and archive reliability – all essential components for building a robust foundation.
A fundamental design decision was to strictly separate the core backup/restore logic from the user interface. All the heavy lifting resides within the GoogleDriveBackup.Core class library. This approach offers significant advantages:
Maintainability: Changes to core functionality (e.g., adapting to Google API updates, improving the incremental algorithm) are contained within the library, minimizing impact on any user interface code. Similarly, UI enhancements don't risk breaking the essential backup mechanisms.
Testability: The Core library can be subjected to automated unit and integration tests independently of any UI layer. This allows for rigorous verification of the critical logic, ensuring reliability.
Reusability & Future UIs: This is perhaps the most compelling reason. With all the complex logic encapsulated in GoogleDriveBackup.Core, developing alternative front-ends is vastly simplified. Want a graphical user interface (GUI) using WinForms, WPF, or MAUI? Need a web-based dashboard or a background service? These projects can simply reference the Core library and interact with its public API (BackupManager, RestoreManager, etc.). There's no need to rewrite the intricate communication with Google Drive, the manifest handling, the parallelism logic, or the repair mechanisms. Developers can focus on the UI/UX, knowing the underlying engine is already built and tested.
The library is structured around several key components, each with distinct responsibilities:
Manager Classes:
GoogleDriveService: Handles the OAuth 2.0 authentication flow using Google.Apis.Auth.OAuth2 and FileDataStore to securely obtain and store API credentials. Provides the initialized DriveService object needed by other managers.
SettingsManager: Responsible for loading (app_settings.json), saving, and managing application configuration via the AppSettings class. Handles defaults, command-line overrides (though applied by the UI layer), and loading/saving settings profiles.
BackupManager: The orchestrator for creating backups. It lists files/folders from Google Drive, compares them against a previous backup's manifest for incremental logic, handles downloading standard files and exporting Google Workspace files, manages parallel downloads, and constructs the final ZIP archive containing the flat file structure and the _manifest.json.
RestoreManager: Manages restoring data from a backup archive to Google Drive. It extracts the ZIP, reads the manifest, recreates the original folder structure on Drive, uploads files (potentially in parallel), and crucially, implements resumable restores using a _restore_state.json file.
RepairManager: Provides functionality to attempt repair of a potentially damaged backup archive. It extracts the archive, validates its contents against the manifest, identifies missing files based on their expected ArchivePath (which contains the File ID), re-downloads only the missing files from Google Drive using their IDs, and generates a new, hopefully complete, archive.
StatusManager: A simpler, potentially legacy, manager for handling a basic backup_status.json file, primarily tracking the last successful backup timestamp.
Configuration Class:
AppSettings: A Plain Old CLR Object (POCO) representing the structure of app_settings.json. Defines properties for Drive IDs, local paths, backup cycle, exclusions, verbosity, and the MaxParallelTasks setting. Includes constants for defaults and helper methods.
Key Data Structures:
_manifest.json: A critical JSON file embedded within each backup ZIP. It contains a list of FileManifestEntry objects, each mapping a file's original Google Drive path, modification time, and size to its unique ArchivePath (typically FileID.extension) within the flat ZIP structure.
_restore_state.json: A JSON file created temporarily during the restore process. It stores the AppSettings used for that specific restore operation and a list of CompletedArchivePaths (the FileID.extension names of files already successfully uploaded), enabling the restore to be resumed if interrupted.
Building the core library involved tackling several technical hurdles:
1. Authentication (GoogleDriveService.cs)
The implementation uses the standard Google OAuth 2.0 flow for installed applications provided by the Google.Apis.Auth library.
It requires a client_secrets.json file obtained from your Google Cloud Console project.
GoogleWebAuthorizationBroker.AuthorizeAsync simplifies the process by handling the browser-based user consent prompt.
FileDataStore is used to securely store the obtained refresh token in the user's application data folder, avoiding repeated authorization requests on subsequent runs.
// Simplified concept from GoogleDriveService.AuthenticateAsync
// ... obtain user credential using GoogleWebAuthorizationBroker and FileDataStore ...
var service = new DriveService(new BaseClientService.Initializer()
{
HttpClientInitializer = credential,
ApplicationName = "Google Drive ZIP Backup Tool", // Replace with your app name
});
// This 'service' object is then used by other managers
2. Reliable File Listing (BackupManager.ListAllFilesAndFoldersAsync)
Efficiently and reliably listing potentially thousands of files requires careful API usage:
A breadth-first traversal using a Queue<(string Id, string RelativePath)> is employed to process folders level by level.
The Files.List API method is used with the Q parameter filtering for children of the current folder ('{folderId}' in parents and trashed=false).
API pagination (nextPageToken) is essential to retrieve all results for folders containing many items.
Requested fields (fields=nextPageToken, files(id, name, mimeType, size, modifiedTime)) are limited to only what's necessary for the backup process.
Exclusion rules from AppSettings are applied during traversal to skip specified relative paths and their contents.
3. Handling Google Workspace Files (BackupManager)
Native Google documents (Docs, Sheets, Slides, Drawings) require special handling as they don't have a traditional file binary to download directly.
Their specific MIME types (e.g., application/vnd.google-apps.document) identify them.
The Files.Export API method must be used, specifying a target MIME type for the desired export format (e.g., application/vnd.openxmlformats-officedocument.wordprocessingml.document for .docx).
Two static ConcurrentDictionary lookups are used:
GoogleToExportMimeType: Maps the Google MIME type to the required export MIME type.
ExportMimeTypeToExtension: Maps the export MIME type to the correct file extension (e.g., .docx, .xlsx, .png).
This ensures these files are backed up in a usable, standard format. Unsupported Google types (Forms, Sites, etc.) are detected and skipped.
4. Incremental Logic & Flat Archive Structure (BackupManager, RepairManager)
This is central to efficiency and reliability:
Flat Structure Necessity: Directly mirroring the Drive folder structure within the ZIP can lead to "path too long" errors on Windows or within ZIP libraries. Furthermore, empirical testing showed that Google's own "download folder as ZIP" function could produce corrupted archives for complex or deep folder structures. To ensure robustness and avoid these issues, the library creates backups with a flat structure. All files reside directly in the root of the ZIP archive.
File ID as Filename: Each file is saved within the ZIP using its unique Google Drive File ID as the filename stem, plus the appropriate extension (e.g., 1aBcDeF...gHiJkL.pptx). This guarantees unique filenames within the flat structure and provides a stable identifier.
The Manifest (_manifest.json): This file becomes the "directory." It stores the mapping between the ArchivePath (FileID.extension) used in the ZIP and the file's original metadata (Drive path, name, modification time, size).
Incremental Check (BackupManager):
If a previous backup ZIP is provided, it's extracted temporarily.
Its _manifest.json is loaded into a dictionary keyed by File ID.
For each file currently listed in Google Drive:
Check if its ID exists in the old manifest dictionary.
If yes, compare the current modifiedTime from Drive with the timestamp stored in the old manifest.
If timestamps match (within tolerance), the file (FileID.extension) is copied from the old extracted archive.
If timestamps differ, or the file is new, it's marked for download/export from Drive.
Repair (RepairManager): The manifest and File ID naming are essential here too. If a file (FileID.extension) is missing from an extracted damaged archive, the RepairManager extracts the FileID from the name, queries Drive for its metadata (like MIME type), and re-downloads it, placing the recovered FileID.extension file back into the temporary directory before re-zipping.
5. Parallel Processing (BackupManager, RestoreManager, RepairManager)
Downloading/uploading many files sequentially is slow. Parallelism is implemented carefully:
SemaphoreSlim: Limits the number of concurrent I/O operations (downloads, uploads, copies) based on AppSettings.MaxParallelTasks. await semaphore.WaitAsync(cancellationToken) acquires a slot, semaphore.Release() (in a finally block) releases it.
Task.Run: Offloads the work for each file operation onto the .NET Thread Pool, preventing the main control loop from blocking while waiting for I/O. async/await is used inside the Task.Run delegate.
System.Threading.Interlocked: Used for thread-safe updates to shared counters (e.g., Interlocked.Increment(ref filesDownloaded)).
Thread-Safe Collections: ConcurrentBag<T> or ConcurrentDictionary<TKey, TValue> are used where multiple threads might need to add to a shared collection simultaneously (e.g., collecting completed file paths during a parallel restore).
Locking: Standard lock(_lockObject) statements are used sparingly for short, critical sections where complex state needs to be updated atomically by potentially concurrent threads (e.g., finding-or-creating nested folders during restore).
// Conceptual SemaphoreSlim usage pattern within a manager
int maxParallel = _settings.GetEffectiveMaxParallelTasks();
using var semaphore = new SemaphoreSlim(maxParallel, maxParallel);
var tasks = new List<Task>();
long itemsProcessed = 0; // Use Interlocked for this
foreach (var item in itemsToProcessList)
{
await semaphore.WaitAsync(cancellationToken); // Wait for concurrency slot
tasks.Add(Task.Run(async () => // Queue work to thread pool
{
try
{
// Perform async I/O operation (e.g., DownloadFileAsync)
bool success = await DownloadFileAsync(item, cancellationToken);
Interlocked.Increment(ref itemsProcessed);
// Update other counters using Interlocked based on success
}
// ... catch blocks ...
finally
{
semaphore.Release(); // ALWAYS release the slot
}
}, cancellationToken));
}
await Task.WhenAll(tasks); // Wait for all background tasks to finish
6. Resilience Features (RestoreManager, RepairManager)
Beyond basic retries in download logic:
Resumable Restores (RestoreManager): The _restore_state.json file tracks successfully uploaded files (CompletedArchivePaths). If a restore is interrupted and restarted pointing at the same temporary directory, the manager loads this state and skips uploading files already present in the list, saving significant time and bandwidth.
Archive Repair (RepairManager): Provides a mechanism to recover from situations where the backup ZIP exists but some files within it are missing or corrupted (assuming the _manifest.json is intact). It leverages the File ID stored in the archive path names to re-fetch only the necessary files directly from Google Drive.
7. Configuration Handling (SettingsManager, AppSettings)
Provides strongly-typed access to settings loaded from app_settings.json.
Applies sensible defaults if specific settings are missing in the JSON file.
Includes validation and clamping (e.g., ensuring MaxParallelTasks is within reasonable bounds).
Supports saving/loading settings to different file paths ("profiles").
Manifest + Flat Structure: This combination proved key to solving long path issues and enabling robust incremental/repair operations by decoupling the archive's internal structure from Drive's structure.
State Persistence for Resume: The _restore_state.json significantly improves the user experience for the potentially lengthy restore process.
Controlled Parallelism: SemaphoreSlim offers a clean way to manage resource contention during parallel I/O. Correctly using Interlocked and thread-safe collections is vital.
API Interaction: Wrapping Google API calls with specific logic (like export handling, field selection, pagination) makes the rest of the library cleaner.
The GoogleDriveBackup.Core library provides a robust, reusable, and testable foundation for interacting with Google Drive to perform complex backup, restore, and repair operations. By addressing challenges like incremental logic, large file structures, Google Workspace file types, and performance through careful design and implementation using modern .NET features, it serves as a powerful engine. Its separation from the UI layer makes it adaptable for various front-end implementations.
In Part 2, we will explore how the GoogleDriveZipBackupTool console application utilizes this core library to provide both an interactive menu-driven experience and a flexible command-line interface for users.