I’m using tusd behind a Cloudflare proxy. Uploads seem to complete — the file size exactly matches the original file on the client. However, after upload:
The resulting file has a different SHA-256 hash
The uploaded video is corrupted
Here is an excerpt from the tusd logs during the problematic upload:
the TCP reset is not necessarily a problem. It just indicates that the upload got interrupted at some point, but resumed properly afterwards (that’s what resumable uploads are for). The presence of such errors in tusd’s log usually don’t indicate a problem as the upload procedure will recover from them.
Regarding the mismatching checksums, I haven’t experienced that on my own. Is this a frequent problem for you? Is it somewhat reproducible?
You mentioned the use of a shared storage. Is it possible that data got corrupted on there? If it’s a shared network disk and sticky sessions don’t work properly, the storage access from the different instances could collide with each other.
The tus protocol has methods for exchanging checksums, so the client and/or server can verify the integrity of the uploaded data, but tusd does currently not implement them. We hope to improve support for this in the future.
I’m not entirely sure whether this is due to a sticky session issue or a shared backend conflict — we don’t currently log which tusd instance is handling each request, so I can’t confirm if requests are routed inconsistently.
However, I did grep through our logs and found repeated mismatched offset errors for the same upload ID:
The first read tcp: connection reset by peer seems explainable (possibly a client disconnect), but the subsequent 409 mismatched offset errors look suspicious — they occurred several minutes apart during the same upload session.
Could this suggest a race condition or desync between multiple tusd instances accessing the same upload resource (file) — possibly caused by inconsistent routing or access to a shared disk?
Do you have any recommendation for logging or guarding against this kind of situation?
Yes, if requests are routed to different instances and the tusd instances are not synchronized via a distributed locking mechanism, such issues can appear. I recommend you to read Upload locks | tusd documentation, which explains this in detail.
Of course, 409 can also be triggered if the client is misbehaving and not resuming correctly after interruptions (for example, by not fetching the new offset with a HEAD request first).
Yes, that’s what I’m trying to confirm before moving forward. I want to make sure sticky session misrouting isn’t the root cause.
I’m planning to remove the shared backend altogether — so that each tusd instance has its own storage directory. That should technically eliminate the need for distributed locking, correct?
If the sticky sessions work properly and reliably, yes, then there is no need for a distributed lock. However, when cookie-based stickyness is used, it might not work with clients that ignore cookies (esp. non-browser clients). If requests don’t get routed properly, you will see 404 errors where requests are routed to tusd instance that don’t have the corresponding upload on their local disk.
I found something interesting: two ChunkWriteStart events in a row without a ChunkWriteComplete in between. Client did a HEAD, saw the new offset, and started a new PATCH before the last chunk finished — classic race condition.
Yes, I think in that case upload locks would definitely be advisable. Please let us know if this improves or solves your problem, so we can adjust the advice we give to people in similar situations.