Back to Portfolio

What Actually Happens When You Upload a File

Ailin Nakaganeku

Ever wondered what goes on behind the scenes when you click that upload button? From chunked transfers to parallel processing, here is how the world's largest cloud platforms handle billions of file uploads every day.

It’s 8:58 AM, the team is waiting, and you drag the final deck into the box just as the clock turns. You watch the progress bar hit 100% and join the call, never doubting that the file actually made it through.

But underneath that simple UI lies a distributed system involving chunking algorithms, parallel streams, cryptographic verification, and state recovery. None of it is magic—it is all engineering, and every piece exists for a reason.

Why Simple Uploads Fail

Traditional single-request uploads are fragile. If a 500 MB transfer fails at 95% due to a network timeout, the entire process must restart from zero. To fix this, modern systems implement three core strategies: chunked uploads, resumability, and parallel transfers.

Breaking Files into Chunks

The solution is to treat large files as a series of independent segments. Each chunk can be uploaded, verified, and retried without affecting the rest of the payload. Different platforms have different chunk size strategies:

  • Google Drive: Uses a 256 KiB minimum. This mobile-first approach ensures that losing a chunk on a 3G connection is negligible.
  • AWS S3: Enforces a 5 MB minimum for all parts except the last one, optimizing for high-bandwidth data center environments.
  • Dropbox: Uses 4 MB blocks to align with their deduplication system.
  • Azure Blob Storage: Supports blocks up to 4000 MiB for massive scale ingestion.
  • Backblaze B2: Enforces a 5 MB minimum, but recommends 100 MB parts to maximize throughput.

Note:Smart systems adapt chunk size dynamically. Start with larger chunks on stable WiFi. Detect packet loss or latency spikes? Shrink chunks automatically.

Resumable Uploads and Session IDs

Chunking alone does not solve resumability. The server needs to remember your upload across requests, browser refreshes, and network changes. A Session ID is a unique identifier representing your in-progress upload state. Think of it as a bookmark: when you reconnect, you present this ID, and the server knows exactly where you left off.

http
1. INITIATE SESSION (Create state)
   POST /upload/drive/v3/files?uploadType=resumable
   Response: Location: https://www.googleapis.com/upload/drive/v3/files?uploadId=xa298sd_sdlkj2

2. UPLOAD CHUNKS
   PUT {sessionUri}
   Content-Range: bytes 0-524287/2097152
   Response: 308 Resume Incomplete

3. CONNECTION DROPS (Client switches network or restarts)

4. QUERY PROGRESS (Query server state)
   PUT {sessionUri}
   Content-Range: bytes */2097152
   Response: 308, Range: bytes=0-524287

5. RESUME
   PUT {sessionUri}
   Content-Range: bytes 524288-2097151/2097152
   Response: 200 OK

The session URI is stateful and durable. Google keeps sessions alive for a week. This means you can close your browser, board a flight, and resume the upload on hotel Wi-Fi hours later. Professional upload libraries persist these IDs to localStorage to survive page refreshes.

Note:The open-source tus protocol codifies these patterns into a formal specification. If you're building upload infrastructure, consider using tus instead of inventing your own proprietary handshake.

Parallel Uploads: The Speed Multiplier

Uploading chunks one at a time is like driving a fleet of trucks down a single highway lane. You're wasting capacity.

Each TCP connection is a lane, and thanks to TCP slow start, every new connection begins at a crawl before ramping to full speed. A single connection rarely saturates modern bandwidth—you need multiple lanes working simultaneously to hit peak throughput.

HTTP/1.1 allows multiple connections (lanes) per domain, giving you a raw speed multiplier. HTTP/2 multiplexes everything onto one connection; it's more efficient but shares a single congestion window. In practice, parallel chunk uploads win on both protocols by mitigating the impact of latency and packet loss.

The sweet spot is usually 4 to 6 concurrent connections for desktop browsers. Going higher yields diminishing returns and increases memory overhead.

The Finalize/Commit Step

Chunks arrive out of order. Connections might still be in flight. The server needs an explicit signal: "I'm done—assemble this thing." Without it, your upload sits in limbo.

Skip the commit step and your chunks become orphaned data—a graveyard of ghost files haunting your storage bucket and costing you money. Every major platform requires this handshake, whether it's S3's CompleteMultipartUpload, Box's SHA-1 commit, or Google's final size header. It is the atomic moment where loose chunks become a valid file.

Note:Smart infrastructure doesn't rely on code alone. You should configure a Bucket Lifecycle Rule (e.g., in S3 or GCS) to automatically delete incomplete multipart uploads after 7 days. This is your insurance policy against billing zombies from crashed clients.

Deduplication: Never Upload Twice

Dropbox built its entire sync engine around block-level deduplication: the idea that identical data should never be stored or uploaded twice. Files are split into 4 MB blocks and hashed with SHA-256. Before uploading, the client sends these hashes to the server, which checks its block index and responds with only the blocks it still needs. If every block already exists in your account, the upload completes instantly.

This is how you copy a 100 MB file within your Dropbox instantly—every block already exists, so the server just links to them. On the storage side, Dropbox deduplicates globally: identical blocks from different users are stored only once, dramatically reducing storage costs at scale.

Delta Sync: Uploading Only What Changed

While Dropbox deduplicates at the block level, Microsoft OneDrive optimizes for files you edit repeatedly. Their differential sync answers a simple question: if you change one slide in a 50 MB PowerPoint, why re-upload all 50 MB?

Differential sync works by dividing files into blocks and comparing fingerprints rather than raw bytes. The client identifies which blocks changed locally, and only those deltas get uploaded. The exact algorithm is proprietary, but the principle is the same one behind rsync: compare compact signatures, then transfer only what is different. Adding one slide to a 50 MB presentation? The sync engine detects the change and uploads just the affected blocks—a fraction of the total file size.

Originally optimized for Office files, OneDrive has extended differential sync to all file types: images, videos, PDFs, ZIP archives. For large files with frequent small edits (think: appending to logs, editing videos, updating databases), the bandwidth savings can be dramatic—transferring only the changed blocks instead of the entire file.

There is a trade-off: CPU vs. Bandwidth. Calculating block signatures requires significant processing power. On older mobile devices, the battery drain from hashing a large file might actually outweigh the speed gain from sending fewer bytes. Modern sync engines often skip delta checks for files under a certain size for this exact reason.

Security: Presigned URLs

Here is the big security challenge: you want browsers to upload directly to cloud storage (bypassing your server for performance), but cloud storage requires credentials.

The solution is presigned URLs. Your frontend requests an upload URL from your backend. Your backend (which has credentials) generates a special URL containing a cryptographic signature using your secret key. The frontend then uploads directly to cloud storage using this signed URL, bypassing your server. There is just one catch: CORS. Since the browser is uploading to a different domain (the bucket) than your website, it will block the request by default. You must explicitly whitelist your domain in the bucket’s configuration, or the upload will never start.

The key security properties: the URL grants temporary, scoped permission to perform exactly one operation. The signature proves authorization without revealing your secret key. Anyone who has the URL can use it until it expires, which is why presigned URLs should be treated as bearer tokens.

javascript
// Backend generates a signed URL for the specific operation
const url = await getSignedUrl(s3, new PutObjectCommand({
  Bucket: "my-bucket",
  Key: "user-123/photo.jpg",
}), { expiresIn: 900 }); // Expires in 15 minutes

Warning:Treat presigned URLs like bearer tokens. Anyone with the URL can write to that location until it expires. Keep expiration times short (5-15 minutes).

Integrity Verification: Trust, But Verify

With chunks flying around in parallel across unreliable networks, how do you ensure nothing got corrupted? Every major platform implements integrity verification using cryptographic hashes.

The approach varies by provider. AWS S3 supports five checksum algorithms—CRC-64/NVME, CRC-32, CRC-32C, SHA-1, and SHA-256. Box requires SHA-1 digests with every chunk and the final commit. Backblaze B2 requires the X-Bz-Content-Sha1 header on uploads, though you can opt out with "do_not_verify". However, Backblaze strongly recommends providing actual SHA-1 checksums to ensure data integrity. The client calculates a hash before sending, includes it in the request, and the server verifies the received data matches. Mismatch? Chunk rejected, must be re-sent.

Modern Transport: HTTP/3 and QUIC

TCP has a fatal flaw called Head-of-Line Blocking. Imagine a one-lane road where a car crash (packet loss) stops all traffic behind it. That's TCP. On shaky mobile networks, one lost packet stalls the entire upload.

HTTP/3 (QUIC) turns that road into a multi-lane highway. Each stream is independent at the transport layer, so a dropped packet on one stream does not stall the others. In 2026, HTTP/3 is supported on every major CDN and all modern browsers ship with it enabled. Adoption for direct-to-storage uploads is still catching up, but the protocol's benefits for mobile networks—connection migration, zero-RTT handshakes, and per-stream loss recovery—make it increasingly relevant for upload-heavy applications.

Resilience Patterns

When a server crashes, thousands of clients might retry at the exact same moment, causing a second crash. This is the Thundering Herd problem. The solution is Exponential Backoff with Jitter.

Instead of retrying immediately, clients wait. And crucially, they add a random delay (jitter) so they don't all come back in sync.

javascript
// Retry with Exponential Backoff and Jitter
const delay = Math.min(
  BASE_DELAY * Math.pow(2, attempt), 
  MAX_DELAY
);
const jitter = Math.random() * delay; 
await sleep(jitter);

Developer Takeaways

Here is the good news: you rarely have to build this plumbing yourself. These tools have already solved the hard problems:

  • tus.io: The open-source standard for resumable uploads. It gives you rock-solid chunking and retries right out of the box.
  • Cloud SDKs: Don't write raw HTTP requests if you don't have to. The official libraries for AWS, Azure, and Google Cloud include high-level upload classes that manage parallel streams and error handling automatically.
  • Managed Services: If you want to skip the infrastructure headache entirely, platforms like Cloudinary, Uploadcare, and Transloadit provide complete, drop-in upload pipelines.

The best upload experience is one the user never has to think about.

That's the engineering behind every file you've ever dragged into a box.