Matrix: Understand Data Deletion for SysAdmins

Deletion is an interesting subject. It’s not as simple as “either it’s deleted or it isn’t”. Matrix is distributed. In order to reason correctly about this stuff we need to understand the different kinds of deletion in principle, as well as how they are implemented in matrix in practice.

This blog post does not (yet) provide a complete answer.

Two broad categories are:

  • user-requested deletion of “their” data
  • pruning unneeded or old data from a server

User-requested deletion ranges from a user redacting some text that they sent earlier, up to a user requesting all their data to be deleted or “forgotten” in the sense of data protection laws. What this means and how it can be applied in a distributed system is a subject for another post: Matrix: Understand Data Deletion for Users.

In this post I’ll look at the latter: data deletion for system administrators.

Pruning Unneeded or Old Data from a Server

The kind of deletion we are talking about here is deleting historical messages and/or file-attachments from one server. A matrix room that has users on multiple servers exists on all those servers. Deleting history from one of them doesn’t stop the others from keeping the history and continuing to serve it on demand.

How can I ensure preservation of the history of a room I participate in? I must either (1) use (as my home server) a server that keeps that history, or otherwise (2) by any means ensure that such a server is in the room.

In practice, to ensure preservation, with the current Synapse server implementation it’s not sufficient just for a history-preserving server to be in the room. You have to ensure it actively fetches all the messages: new messages that it may not receive due to network outage and similar; and history from before it joined the room, if applicable. A manual way to fetch missing messages is to open up the room in a client and scroll all the way back to the start, which is tedious but does force the server to (try to) fetch any missing history from other servers. Ideally there would be a configuration option where users/groups/admins could tell their server to mirror all history without user intervention if they want to. This feature has been requested.

Retention Policies

Synapse allows setting some kind of message retention policy, such as discarding user-deleted data after 3 days and discarding all messages older than 12 months. You would have to check what is currently supported: for example, currently this issue says there is no way to apply such a policy to file attachments.

Minimizing Disk Usage

There’s also to consider the angle of minimising the existing data size. Synapse is well known to use far more space than the data really requires, in the order of ten times more. The original design choices, as I understand, were towards being a “reference” design rather than being optimised for production.

There are some known ways to make Synapse’s db smaller. I have seen good reports about this procedure, though haven’t yet used it myself: Levans’ workshop: Compressing Synapse database.

Alternatively, other servers exist that may be feasible to switch to in the medium term (next year or so) that may have much smaller storage requirements. In order to switch to them, folks are going to need to develop data import/export formats and tools. I have seen this mentioned but have not yet seen progress on it.

In My View

Each organization running a server is welcome to decide their retention policy.

The matrix ecosystem needs to develop better tooling for users to control retention of their own data (within the server admin’s limits), and better user awareness of it so users can meaningfully and easily choose servers and policies.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments