asdf-dependency-grovel

Quickdocs

asdf-dependency-grovel

2017-04-03

Analyse the dependencies in an ASDF system

Upstream URL

gitlab.common-lisp.net/xcvb/asdf-dependency-grovel

Author

Andreas Fuchs, Matthew Steele and Francois-Rene Rideau

License

Not determined

README

Provided Systems

Basic Overview ============== First, I'll give a brief overview of how asdf-dependency-grovel (ADG) works, so that you'll be better able to understand the nitty-gritty details that come later. Constituents ------------ A "constituent" represents a piece of the codebase. Constituents are arranged into a tree, with the children of a constituent representing disjoint pieces of their parent constituent. The constituent objects are used to store information about dependencies between different parts of the codebase. ADG's constituent system is designed to be quite flexible at representing different shapes and sizes of codebases, but in most cases (including QPX), the constituent tree will have exactly three levels. At the top, there is a single root constituent that represents the whole codebase. The root constituent has as children one constituent for each file in the codebase -- these are called "file-constituents". Each file-constituent has as children one constituent for each top-level form in the file -- these are called "form-constituents". The goal of ADG is to determine the dependencies between the various form-constituents, which allows it not only to infer dependencies between files, but also to determine how the top-level forms could be reorganized between files so has to eliminate file-level dependency cycles. Provisions and Uses ------------------- ADG detects dependencies through the use of the `signal-provider' and `signal-user' functions. When a form "provides" something -- for example, a defun form provides the function being defined -- then ADG will call the signal-provider function with the name of the thing being provided (as a symbol) and a symbol indicating what kind of thing was provided. This will cause that "provision" -- that is, the name/kind pair -- to be added to the current constituent's set of provisions. Similarly, when a form "uses" something (at compile time), ADG will call the signal-user function with the name and kind of the thing used, which adds the use to the current constituent's list of uses. In general, if constituent A provides X and constituent B uses X, then we conclude that B depends on A. This is ADG's fundamental mechanism for determining dependencies. This approach is relatively easy to implement, and usually works, but it is also flawed. Suppose that we have three forms: A, B, and C. Forms A and B are both (defconstant +sqrt2+ 1.41), and form C is (defvar *foo* +sqrt2+). ADG will call (signal-provider '+sqrt2+ 'defconstant) for each of forms A and B, and will call (signal-user '+sqrt2+ 'defconstant) for form C (as well as calling (signal-provider '*foo* 'defvar)). ADG will then conclude that form C depends on _both_ A and B, when in reality, Common Lisp allows it to depend on _either_ of A or B, since defconstants may be repeated as long as they all have the same value for the constant. ADG simply doesn't try to handle this case correctly, largely because doing so would complicate some of the later stages with relatively little benefit. Finally, it should be noted that ADG detects provisions and uses heuristically, and as a result, will frequently have both false negatives and false positives when detecting dependencies (we'll see later on why this is, sadly, unavoidable). The main trick to making ADG work is to balance these so that the wrong answer it comes up with is still good enough to successfully eliminate the dependency cycles in the particular codebase of interest. Cycle Resolution ---------------- Once ADG has collected all the provisions and uses for each constituent, it builds a directed graph with form-constituents as nodes and dependencies as edges. We will assume that the resulting graph will be acyclic, which should be true if the code was able to compile and if the dependency detection didn't have too many false positives. The goal in this stage is to eliminate dependency cycles between files. We could do this by splitting _every_ form into its own file, but that would create a huge mess. Or, we could merge everything into one file, but that is a similarly degenerate solution. Or, we could drastically rearrange the code, but that too would be confusing to the poor humans to organized these forms into each of these files for a reason. The compromise the ADG takes is to only ever split existing files into pieces (not to merge anything), and try to make as few splits as possible. To do this, ADG starts with the graph of form-constituents, and treats each form-constituent as though it were a separate file; it then merges together pairs of nodes into supernodes as much as possible, subject to the constraints that 1) it may only merge nodes that originally came from the same file, and 2) it must never perform merges that would introduce a cycle into the graph. As it turns out, perfectly minimizing the number of nodes in the final graph is NP-hard, so ADG uses a greedy algorithm that seems to make for a reasonable approximation. Once ADG has finished merging the graph, it sorts the nodes in that graph topologically to create a new file order. In order to protect against false negatives in the dependency detection, ADG uses a _stable_ topological sort, keeping the files as close to the original file order as possible (thus, if file B depends on file A and ADG didn't notice, there's still a good chance that file B won't load until after file A in the new file order). File Splitting -------------- Once ADG has determined how to split the files and what order to put the new files in, we can run its output through a program that will automatically divide up the actually files. Here we have to deal with one hitch -- ADG doesn't really understand the significance of in-package forms, and will treat them just as any other form, to be put one one piece or another of a split file. To deal with this, the program that actually splits up the files will check for pieces of files that don't have the in-package declaration; for each such piece, it will check the _original_ file to see if it has only a single in-package form, and if so, it will copy that form to the top of the split piece. If the file has multiple in-package forms (an unusal case), then the program simply prints a warning, indicating that a human should do it manually. Fortunately, there are usually no such warnings. How it _Really_ Works ===================== Now that you have a basic idea of what ADG does, let's take a look at the nasty details of how it detects provisions and uses. Beware -- this will be an adventuresome journey into just how annoying Common Lisp can be as a language. Fine-Grain Instrumented Load ---------------------------- The first piece of trickery is how ADG can keep track of which top-level form in the file it is working with, so that it can instantiate the constituents correctly. ADG provides a drop-in replacement for the load function called fine-grain-instrumented-load that does this. Intuitively, this simply needs to pass the file through the Lisp reader to get a list of forms, and then eval each form in a new form-constituent. However, the Common Lisp spec requires a variety of particular, subtle behaviors from the load function, which makes it much more difficult to provide a portable, drop-in replacement. Thus, ADG simply uses a modified copy of the load function from the SBCL source code. That makes it unportable, but oh, well. The replacement load function instantiates a new form-constituent for each top-level form, and also preprocesses each form before eval-ing it (this step is described in later sections). Finally, the replacement load function keeps track of some anciliary data, such as the index of the form in the file (e.g. 1st form, 2nd form, etc.), which serves to uniquely identify the form, and the position of the form in the file as an index into the character stream, which aids in the automatic file splitting later on. The Macroexpand-Hook and Handlers --------------------------------- The primary mechanism that ADG uses to find dependencies is a custom *macroexpand-hook*. This is a function that Common Lisp will call whenever it is about to expand a macro; this gives ADG an opportunity to examine the macro form, or even to alter it. In general, when the code expands a macro foo, ADG calls (signal-user 'foo 'defmacro) to indicate that a macro is being used. However, ADG also has a number of "handlers" which it uses to give special treatment to a number of standard Common Lisp macros. For example, defmacro is itself a macro, and when ADG's macroexpand-hook sees a defmacro of a macro foo, it calls (signal-provider 'foo 'defmacro). When ADG sees a defun macro for a function foo, it calls (signal-provider 'foo 'defun) and also alters the function definition to include a call to (signal-user 'foo 'defun), so that if the function is called at compile time, it will signal a use of the function. When ADG sees a defstruct macro, it does all kinds of squirrelly things, because defstruct is a horrible mess; among the things it does is to insert code after the defstruct to "retroactively" instrument the accessor functions that are automatically generated by the defstruct macro. There are a wide variety of other handlers defined in ADG, and more can be easily added if necessary. It should be noted that it is largely a fortunate coincidence that many of the things that we are interested in (e.g. defmacro, defun, defvar, etc.) happen to be defined as macros in Common Lisp, and can thus be caught by the macroexpand-hook. There are other things we might be interested in, such as eval-when, that are defined as special forms; we have no easy way to detect these without walking the code "manually". (Incidentally, _Common Lisp: The Language_, second edition, describes a facility called *evalhook* that allows a user-defined function to be called on _every_ subform as it is evaluated. Such a facility would be extremely useful for ADG. Unfortunately, that facility is no longer part of the Common Lisp spec, and is not supported by SBCL.) Detecting Variable Uses ----------------------- It is generally easy for ADG to detect when global variables (or constants) are provided, as the macroexpand-hook will notice defvar and defparameter (and defconstant) forms (ADG treats defvar and defparameter as essentially equivalent for the purposes of dependency detection -- when it sees a defparameter form, it signals a provision of kind defvar). What is harder is to detect when a global variable or constant is being used at compile time. The trouble is that there are a lot of different ways to "use" a global variable at compile time. You could read the variable and use its value at compile time. You could alter the value of the variable at compile time with setf or setq. You could dynamically rebind the variable at compile time with let. You could even quote the variable name and pass it to a function like boundp or constantp, at compile time. And yes, QPX does all of the above. There are several clever strategies that don't work. One is to instrument defvars so that the variable name is actually symbol-macro that expands to some code that signals a use of the variable and then returns the value. You can even write a setf function so that setf will work with such macros. Unfortunately, this doesn't work with let forms or quoted symbols, since those are not subject to symbol-macroexpansion. Another neat idea would be to use some kind of alternative environment object that could somehow detect when global variables (or other things!) are being used. However, there is explicitly no portable way to do this, and I can't even find a non-portable way to do it in SBCL. In the end, what ADG actually does is incredibly stupid. When it sees a defvar, not only does it signal a provision, but it adds the symbol to a set called *suspected-variables*. Before evaluating each form, ADG performs a preprocessing step that, among other things (see later sections), walks all the symbols in the form, looking for ones that are in *suspected-variables*; if it finds any, then it assumes that that is a use of the variable, and makes an appropriate call to signal-user. Of course, this results in a number of false positives, since many of those appearances will not be relevant at compile time. However, amazingly, it seems to be good enough. The Readtable and Constituent Transfers --------------------------------------- This is where things start to get a little more hairy. QPX contains a large number of sharpdot (#.) forms. These indicate a form that is to be evaluated at read-time, and the result to be inserted into the code in place of the sharpdot form. Of course, these sharpdot forms may well make use of functions defined in other files, in which case there is a dependency between the top-level form containing the sharpdot and the top-level form that provided the function (or whatever) that the sharpdot form used -- in particular, we must ensure that when we split up and reorder the files, the file in which the providing form ends up must come before the file in which the sharpdot ends up. The trouble is that the evaluation happens while the file is being read, which is before ADG is able to break the file up into top-level forms and set the *current-constituent* variable to each form-constituent in turn. Thus, if the evaluation of the sharpdot invokes any signal-user calls, the use will not be added to the correct constituent. ADG solves this by using a modified *readtable* with a replacement sharpdot reader macro function (as well as other replacement reader macros -- more on that later). The new sharpdot sets the *current-constituent* variable to a newly instantiated "temp-constituent" object -- a constituent that has no parent -- before evaluating the form. Thus, this temp-constituent will catch any use (or provision) signals caused by the evaluation of the sharpdot form. Then, the replacement sharpdot function must return a piece of data containing both the result of the evaluation (which should be inserted into the code) as well as the temp-constituent object. When we get around to evaluating the top-level form in which the sharpdot appeared, we will "transfer" the signals from the temp-constituent to the new *current-constituent*, so that the form in which the sharpdot appears will have the proper dependencies. Now, what should the sharpdot function return to allow us to make this transfer? An obvious answer is to have it return a macro call that, when it expands, performs the transfer and the returns the result value. Thus, the sharpdot reader macro function would return something like `(with-transfer-constituent ,temp-constituent ,result). Since macroexpansion won't happen until we evaluate the top-level form in question, and thus when we have the proper *current-constituent* in place, this seems workable. The trouble is that sharpdots may appear in places that are not subject to macroexpansion. Instead, we will have the sharpdot reader macro function return the with-transfer-constituent list as before, but we will perform our own macroexpansion-like operation during the form preprocessing step described in the previous section that deals with appearances of with-transfer-constituent anywhere in the form. Walking a form and replacing with-transfer-constituent lists is actually much trickier than it sounds, because there are a number of edge cases. For example, one might think that when walking a subform that is a list, one could simply check each item of the list to see if the item is a list beginning with the symbol 'with-transfer-constituent. Not so. Someone might write something like (progn . #.(loop ...)), in which case the 'with-transfer-constituent symbol will appear as the second item in a list. Yes, that code is legal Lisp. Yes, QPX does that. A lot. All told, this strategy for dealing with sharpdots works pretty well, but it opens up a can of worms -- there are a handful of very nasty, very subtle issues that I won't go into here (but that are documented in the code). Suffice to say that these issues are largely the result of certain parts of Common Lisp, particularly backquotes and compiler macros, being underspecified in perilous ways. Manual Annotations ------------------ Despite all the heuristics ADG uses to try to find dependencies, sometimes you have to give it some hints. This is a slightly dicey game, and requires some familiary with ADG to know how to make it work, but when carefully done it can help tremendously. One all too common case is to have a global variable that is a hash table, and have one file that populates the hash table at compile time, and another file that requires the hash table to be populated in order to compile. There's not really any way for ADG to understand this dependency, but it's usually easy to annotate manually: the function that adds an entry to the table can make a call to signal-provider with some appropriate tag, and the function that expects the entry to be there can call signal-user. ADG adds the symbol :groveling to *featuresa* when it's running, so you can use code like #+groveling(asdf-dependency-grovel:signal-user foo-key 'foo-table-entry) to ensure that the codebase won't be affected when you're not using ADG. Another possible problem is when there are two adjacent forms that should be kept together, but ADG fails to understand this and happens to try to split them into separate files. One example from QPX is code that looks like: (defvar *foo*) (with-build-version :foobar-3 (setf *foo* (make-hash-table))) These are two separate top-level forms, but splitting them up can cause breakage, and ADG is easily confused by defvar dependencies. A simple solution is to simply wrap these two forms into a progn, thus making them a single top-level form and preventing ADG from splitting them: (progn (defvar *foo*) (with-build-version :foobar-3 (setf *foo* (make-hash-table)))) How to Use ADG ============== *** more here *** -------------------------------------------------------------------------------- First, let's worry about other reader macros. Sharpdots can appear just about anywhere, including inside other reader macros. For example, you could write something like #+#.(stuff). Yes, QPX does that too. Fortunately, that's no trouble -- we just need to replace other reader macros in the readtable so that the perform the proper preprocessing step before doing their thing. While we're at it, we can instrument sharpquote to signal a use of the function in question; of course, like sharpdot, it'll have to return a with-transfer-constituent form so that the signal can get properly transferred in the preprocessing step. That's easy enough, until we get to the next issue. How about backquotes? What happens when you write `(foo #'(setf ,bar)) after we replace sharpquote as above? Actually, the behavior is not portably defined, because Common Lisp doesn't specify how backquote works, only what it does when evaluated (in particular, the value of (car (quote `(a ,b))) is undefined in Common Lisp). In SBCL, I've found that that snippet causes quotes to appear in unexpected places, causing the with-transfer-constituent replacement to get confused. Mercifully, this sort of thing actually doesn't seem to crop up in QPX (although it does in QRES), and ADG currently makes no attempt to deal with it. Now, how about compiler macros? Compiler macros have the useful (but not well specified) property that if a compiler macro function returns the same form unchanged, then compiler-macroexpansion of that form halts, even though the resulting form is of course still a valid compiler macro. Thus, when performing with-transfer-constituent replacement, we must be