March 2019 – Julian Foad

Thoughts on where Subversion should be going, code-wise.

A companion page to What’s In My Head which is about community and project management aspects of Subversion.

General remarks:

Progress is slow: attempting to do anything with svn code is slow going. Reasons?
Technical debt: low level C code doing business logic, compatibility (almost bug-for-bug) trumps rewrites.
Functionality that may not suit all users (diff, patch, merge, externals, keywords, eol-style, WC storage, repo storage, …) has usually been built in, where pluggable interfaces would encourage modularity. Pluggable “diff” is crude, incomplete. Pluggable RA layers are our best example.
Writing third-party software that uses Svn is not as easy as it should be. See: Bindings; etc.
Companies writing commercial systems do not interact much; how can we improve that?
Rewrite clients (svn) and high level libs (libsvn_client) in a more suitable language?

Internal improvements needed:

WC APIs. The WC layer needs API that reflects operational units. For example, “a tree in the repository” should function interchangeably as the base of a copy or of a checkout; and “the modifications to a tree”; but not “merge” and “update” which are client level concepts; thus both higher and lower level than current APIs. See: SVN-4786 “WC working-mods editor”, for a start.
Diff in particular shows up many inconsistencies due to bitty implementation. Once generic changes APIs are in place, rewrite ‘diff’ to use them: diff(WC, repo): {a=tree_open(WC), b=tree_open(repo); diff_trees(a,b)}.
Diff: distinguish “diff a commit” from “diff two trees”.
Bindings. State is poor, inconsistent. What’s needed is an OO generic interface, implemented in several languages. An implementation technique: write C++ OO layer on top of existing C API, then wrap it in other languages. Problem: underlying svn API design is not OO. Opportunity: design the APIs we would like to provide; reimplement libs to provide them. See Brane’s current “svncxx” work.
Repository API: Eliminate no-op changes.
Repository API: Copy content without recording ‘copy-from’.

Feature improvements:

Server-side “pull request” merging.
Renames. TODO: Write up how my strict element-id scheme (see svnmover) can improve handling renames in update and merge, even in a primitive form without server-side support. To integrate it in existing svn, depends on: WC APIs; rewrite of merge code. Develop a variation where directories are ephemeral, for better real-life semantics (like splitting and joining dirs).
Query language (Hg revsets; Paul Hammant; https://graphql.org/learn/GraphQL).
Repo syncing (revprops and locks as versioned/sequenced events; Merkle trees).

General Remarks

Subversion’s client side code is particularly obtuse.

The server side is considerably cleaner. Its user is the network protocols, it is focused on doing its job quickly and accurately, and its API surface is relatively compact and stable. The API semantics are not perfectly clean and the implementation is complicated by a lot of optimization and caching which sometimes introduces bugs, but overall it is a lot better than the client.

Rewrite clients (svn) and high level libs (libsvn_client) in a more suitable language? This code consists of simple logic, filters, configuration, hashes, composition, etc., which in C is tedious, error-prone, inconsistent, contributing to buggy inconsistent UX. Python would make sense for the client itself. For the libs, difficulty is how to provide bindings; C is good base language for that. Rewrite might not make a lot of sense for the existing client, because it has too many idiosyncrasies which would have to be either recreated in excruciating detail or smoothed out losing compatibility. However, a rewrite dropping a lot of oddities and little-used features such as externals might make make a lot of sense if we want a new client for e.g. Android and IOS.

WC APIs

The WC layer needs an API that reflects the granularity and shape of the operations we want to perform on it. Usually these operations are reading or modifying a given subtree to a given depth, with criteria such as selecting items with a certain set of statuses.

When the client commit code is scanning a WC, looking for items to commit according to the user’s given criteria, it should not itself have to iterate over a list of children, deciding which ones fit the criteria, looking up the child’s schedule status and deciding whether to recurse into a subdirectory based on its schedule and other options. Instead it should be able to invoke a suitable tree walker which does all that, and just perform commit processing on each item found. This same tree walker should be shared with other client functions such as diff and shelve that also want to operate on a similar selection.

The base of a checkout and the base of a copy are both “a tree in the repository”, more or less. There should be APIs to put such a tree into the WC storage, read it out, use it as the base of a checkout, use it as the base of a copy, use it as the base of a diff, and so on, and these operations should all use a common, interchangeable “tree” object. At present there is no such “tree” object concept in the APIs, and each of these operations is implemented implicitly by code that is specific to each use case, all different from each other, so that there is no chance of obtaining consistent results and no easy way to spin up a new instance of such behaviour in the implementation of shelving.

To improve this, I plan to create a set of interfaces for accessing the WC and other WC-like data such as shelves. One part of this interface is dedicated to accessing the local modifications that can be committed; this must be compatible with the interface used to commit changes to a repository and read them out. Another part is dedicated to accessing the base state of the WC; this will have something in common with the repository access interfaces too, but also will have notions such as “switched”. A third part is dedicated to accessing the uncommittable WC state such as changelists, conflicts, obstructed, missing, and unversioned items.

For the “committable changes” interface, svn_delta_editor_t is a good starting point as by definition it must support the necessary operations for a commit.

For the “uncommittable changes” interface, something new must be invented, perhaps starting with the WC “status” APIs.

For the “base state” API, the recent experimental “viewspec” output feature built on top of the WC “reporter” shows how we can describe the base state. To that must be added some way of transferring the necessary content. The present implementation directly contacts the repository URL through the RA layer when it needs to fetch base content; this needs to be abstracted out to the new interface.

Subversion’s “copy” operation comprises not just local changes but also a request for a new base subtree to be fetched. Therefore there is some dependency between the “committable changes” interface and the “base state” interface. Either the “committable changes” interface will need to depend directly on “base state” in order to fetch more base content when it needs to, or the caller will need to ensure that the necessary base content has been provided in advance and the “copy” method will need to be able to find it.

WC Tree-of-Changes Walker

(May be needed by: “Revert”, “unshelve”, …)

Need a tree-of-changes walker that visits nodes in a depth-first order with: (1) visit a dir before its children and also afterwards; and (2) visit a replaced subtree twice, once for the delete and then again for the add. Useful especially but not only for WC.

Diff a Commit vs. Diff Two Trees

Repository API: Eliminate No-op Changes

A curse of inconsistency. Some operations in the commit API, such as “open a file”, leave a trace in the FS even if no change in content is made, which then manifests in a similar API method call sequence when the commit is replayed. We often call these “no-op changes”, though “touches” may be more descriptive.

Arguing from a written description of the issue, without concrete details, it has proven hard to convince people that this is a bug: they can argue that storing and retrieving an extra flag that “the object was touched” is a feature. Using two approaches I am confident we can show it is a bug. (1) Show that the behaviour is inconsistent. (2) Argue that the Subversion data model is supposed to be based on stored state, and this amounts to “hidden state”.

There are different potential cases where a no-op change might be considered: open a file and close it, open and apply a null text-delta, copy and apply a null text-delta, change existing property to its same value, delete nonexistent property, etc. TODO: test many cases and report the inconsistencies between them.
There are different methods for information retrieval: list changes, replay revision, delta-dirs, update editor, etc. There are different API levels to try: FS, repos, RA. TODO: test retrieval methods and report the inconsistencies between them.

(2) Data model. In the general model of versioning a series of snapshots of user’s data, we should be able to extract any two stored states and calculate the difference between them, but a “touched” flag does not work like that; it is a function of some hidden state.

Repository API: Copy Content Without ‘Copy-From’

Should be able to:

commit a reintegrate merge from WC to repo efficiently, in O(1) time and size.
perform a reintegrate merge server-side, in O(1) time and size (new feature)

When committing changes, it is sometimes the case that the target content (of a file, subtree, or branch) is a content that already exists somewhere else in the repository. This is common in merging, where the merged content of a file (or subtree) on the target branch is identical to a content that exists on the source branch. In a reintegrate (or “copy up”) merge, the entire content of the target branch becomes equal to the current content of the source branch.

It is inefficient for the client to transmit changes against its working copy base to recreate the desired content on the server, when in principle it could instead tell the server where else that content may be found.

The existing “copy” method in the editor API copies the referenced content. At the same time it rewrites the history pointer to point to the “copied from” location. What we need is a variant of “copy” which keeps the default “natural predecessor” history.

(There is a “link” method in the FS API, which is not exposed in the RA API. It appears to be somehow related to this. TODO: investigate whether that is of any use.)

Server-side “pull request” merging

A companion page to What’s In My Head.

Some specific code enhancements

WC subtree read/write APIs. The WC layer needs API that reflects operational units such as “a tree in the repository” to function consistently as the base of a copy or of a checkout; and “the modifications to a tree”; but not “merge” and “update” which are client level concepts; thus both higher and lower level than current APIs. See: SVN-4786, for a start.
Diff in particular shows up many inconsistencies due to bitty implementation. Once generic changes APIs are in place, rewrite ‘diff’ to use them: diff(WC, repo) := {a=open(WC), b=open(repo); diff(a,b)}.

System-level topics

Renames
- My strict element-id scheme (see svnmover) can improve handling renames in update and merge. Write it up, esp. how it can be used primitively without server-side support. To integrate it in existing svn, depends on: WC subtree read/write APIs. Develop a variant where directories are ephemeral, for better real-life semantics (like splitting and joining dirs).
Query language
- Hg revsets; Paul Hammant; https://graphql.org/learn/GraphQL.
Repo syncing
- The design of repo syncing (svnsync) is incomplete: especially revprops and locks
- revprops and locks as versioned/sequenced events; Merkle trees?
Rewrite high level libs (libsvn_client) and clients (svn) in Python
- They consist of simple logic, filters, config, hashes, composition, etc., which in C is tedious, error-prone, inconsistent, contributing to buggy inconsistent UX. For libs, difficulty is how to provide bindings; C is good base language for that.
OO Bindings
- Bindings state is poor, inconsistent. What’s needed is an OO generic interface, implemented in several languages. An implementation technique: write C++ OO layer on top of existing C API, then wrap it in other languages. Problem: underlying svn API design is not OO. Opportunity: design the APIs we would like to provide; reimplement libs to provide them.
Plug-ins
- Functionality is built-in first, where pluggable interfaces would be better (diff, patch, merge, externals, keywords, eol-style, WC storage, repo storage, …)
- example: ‘diff’ plug-in support exists but is weak: want different diff tools for different file types, want to configure an external tool for tree diffs
A new ‘svn’ client
- Consider writing an alternative ‘svn’ client from scratch with consistent functionality built from layers of blocks, not attempting to emulate CVS, and taking inspiration from git/hg/bzr. For example it should provide a function to output a tree (or file) which should encompass all of the existing ‘ svn list’, ‘svn cat’, ‘svn proplist’, ‘svn export’, ‘svn diff -r0:REV’, ‘svnrdump dump’; and a function to input a tree (or file) which should provide the inverse of all of those output modes, encompassing ‘svn propset’, ‘svnmucc put’, the hypothetical ‘svn addremove’, etc.

General observations

Progress seems to be stifled. Why?

It’s a “maturing” project with decreasing volunteer activity.
Technical debt: low level C code doing business logic, compatibility (almost bug-for-bug) trumps rewrites.
Writing third-party software that uses Svn is not as easy as it should be. See: Bindings; etc.
Companies writing commercial systems do not interact much; how can we improve that?

Month: March 2019

Svn Code Developments in my Head