Checking the typechecker mypy

Analyzing a system which is usually doing the type-checking analysis for us, feels like we’re suddenly on the other end of the rope. This time, we’ll cover the code quality aspect of the python library mypy and how quality is guaranteed. Read on, if you’d like to know how we think mypy can be improved on an architectural level!

Maintaining quality mypy code

To ensure software quality, mypy strives to do three things. The first one is very obvious and something which every open software project strives to do: identify and try to solve the open issues. Secondly, refactoring and simplifying parts of the code such as the conditional type binder and the semantic analyzer will decrease the probability of bugs in the future, which is essential for code quality. Finally, by means of an extensive test suite, existing features are always ensured to keep working with the addition of new features or the addition of code to fix open issues. In the following sections, we will guide you through the process of upholding the quality standards of mypy.

Delivering fluent updates (CI)

Mypy delivers its code fast by triggering Continuous Integration on AppVeyor for Windows-based systems and TravisCI for Unix ones. While AppVeyor runs 2 simple tests, TravisCI covers a whole set of different Python versions and configurations at the same time. Furthermore, the compiled version mypyc is run there to test the new changes of mypy. Concluding, mypy is tested on OSX, Ubuntu and Windows.

Finally, TravisCI is used to build the documentation, trigger a wheels build, checking code style and to perform typechecks of mypy commands. Especially the latter is just amazing when thinking about it!

The CI pipeline of mypy triggers a suite of tests, which can be run locally as well. This testing library has the nice feature to be able to specify which test to run. For one of our contributions to mypy (PR #8524), being able to run only one test, for a specific feature, instead of the full test suite, would be preferable. With the mypy testing library this is possible using the following command:

python -m pytest .\mypy\test\testpep561.py::TestPEP561::test_mypy_nositepackages_setting_accepted

The example is an integration test, which sets up a virtual environment for each test. Let’s take a look what testing methods are applied to screen mypy in the next section.

Testing: a layered approach

Mypy implements three different kinds of tests in order to maintain quality. It can be seen as a layering, which you might be familiar with; the division into unit, integration and regression tests. On top of these layers there is one final layer involving a linter, to wrap it all up.

Unit tests

First there are the unit tests, which test each individual component or class. They are what new contributors often quickly come in contact with, as we have noticed ourselves in trying to get our first pull request in. The way to write such unit tests is clearly laid out in the developer guide. Developers are expected to write their own unit tests, which are more like test case description files. In these text files, labeled with the .test extension, these descriptions are parsed by a class in mypy/test/data.py to create the test. Most of the tests, simply perform a type check on the code. However, there is a different type of unit test defined in mypy/test/testpythoneval.py, which not just type checks a given program, but also runs it.

Integration tests

You can see unit tests as ensuring that small individual components work the way they are intended to (very local in the code). Integration tests on the other hand have a more global scope. They ensure that all the parts put together, the system as a whole, works the way it is supposed to. Integration tests are defined for mypy in test files located in mypy/test/test*.py.

Regression tests

Regression tests ensure that when new functionality is added, the unit and integration tests are all rerun. This make sure that everything works as intended after the addition of the new functionality. One could run the regression tests from the mypy repository directly by running pytest. Pytest is automatically initialized by pytest.ini, which specifies the path and python test files which should be run.

The final layer

You might think that this would be enough, but wait, there is more. Another layer is placed on top of the tests, which is defined in runtest.py. This is the testing entry point, which the documentation in the main repository points to. It runs the pytest described before with the addition of linting with flake8, to check code quality standards.

Analysis of how mypy upholds quality standards

Wait. Quality is standard right? Unfortunately not always. We will analyze Code quality, testing and technical debt to check the way mypy upholds it quality, by looking at the mypy repository, its documentation, Github issues and Github pull requests.

Documentation

The developer guidelines located in the Wiki section of the github repository, states code quality guidelines and coding conventions. These are very simple for contributing to mypy in general: basically enforcing flake8. The requirements for code quality in contributing to typeshed, are far more extensive as described here. Besides code quality, there is a reference on how to write unit tests for implemented functionality¹. There is no reference to the concept of technical debt in the developer guidelines either, which makes sense as only the core team of developers and not your average contributor can identify cruft: “deficiencies in internal quality that make it harder than it would ideally be to modify and extend the system further”².

Issues

When a user wants to create a new issue, they are given guidelines on what information to specify in order to make the process go more smoothly. The core team labels the issue and readily provides feedback to contributers of the community who might want to take a shot at providing a fix. Overall there isn’t a discussion of code quality / testing /technical debt until the user has submitted a pull request and the core team takes it upon themselves to review the code.

Pull requests

Large contributions within pull requests are discouraged. The contributer is instead encouraged to split the contributions into several parts, so the reviewing process moves more quickly. In addition to the existing test suite being passed, the user will be encouraged to write their own tests in case the contribution is substantial enough. The code quality which mypy requires, is checked purely programmatically, as mentioned already in the documentation section above. With regards to technical debt, it is up to the core contributor who has insights into the code base at a larger scale, whether a contribution can be allowed.

Suggestions on how mypy can be refactored

Okay, now we know what quality standards should be upheld, lets review some stuff! The tools we use to review are Sigrid and SonarQube. We have chosen to perform an analysis using both tools to get an even broader review of points where mypy can be improved.

An analysis by Sigrid

We have already given you some analysis that Sigrid has performed with the dependency graph in the last blogpost. We will purely consider the refactoring suggestions provided over the entire code base and not necessarily an analysis per component. These refactoring suggestions in Sigrid are called “violations” and there are 4 identified categories which we will discuss for mypy here briefly.

Maintainability violations

Mypy has a considerable amount of maintainability violations listed in the table below.

Violation type	Instances in mypy	Brief description
Unit complexity violation	330	Code units which are excessively complex.
Unit size violation	277	Code units which are excessively large.
Unit interfacing violation	251	Use of too many parameters in calling a unit of code.
Duplication violations	155	Instances where there is duplication in the same or another file.
Component independence violations	29	The ratio of incoming, outgoing, throughput and internal calls.
Module coupling violations	20	There are several incoming calls to this module.

Severe violations

In this category Sigrid has identified 20 violations. Most of them are in build.py, i.e. there is an instance where a catch block in a try-except type code structure is missing.

Warnings

Of these Sigrid identifies 335 instances of discovering too many “TODO” and “FIXME” uses in the same file, which indicates poor quality due to incompleteness. Furthermore Sigrid identified 180 instances of lines of code in comments, although by further analysis these have turned out to be (mostly) false positives as they were complementary information that made the code more understandable (i.e. type indication for a variable).

Code smells

Of these mypy has found a considerable amount of violations listed in the table below. This is a selection based on the number of instances, there are other types of violations as well that mypy discovered with a lower frequency.

Violation type	Instances in mypy	Brief description
Redefined symbol	262	Duplicate names in same scope.
Shotgun surgery violation	162	When introducing a small new change would violate the ‘Don’t Repeat Yourself’ principle.
Data clumps violation	46	Data group reappearing as parameter to operations throughout the system.
Extensive coupling violation	43	A class or a module that depends a lot on other classes or modules.
Internal duplication violation	25	Significant duplication within a class or module.
External duplication violation	21	Significant duplication between classes or modules.

An analysis by SonarQube

SonarQube Community edition provides the following code analysis results:

Sonarcube detects no bugs or vulnerabilities. The metric on test coverage isn’t very useful. The cause of an indication of 0% coverage is likely due to SonarQube not being able to recognize the tests written for mypy. The most interesting metrics are: Security Hotspots, Technical debt and Code Smells. We will review each of these more extensively.

Security Hotspots

In terms of security hotspots, SonarQube gives a categorization of the different priorities of possible security threats in the code of mypy.

High-priority threats:
- Command injections for instances where sys.argv is used directly
- Execution of system commands with subprocess
Medium-priority threats:
- Denial of service attacks by the use of regular expressions
Low-priority threats:
- This was a false positive where SonarQube analysed a link within the comments and suggested use of https over http

Technical Debt

In terms of technical debt SonarQube detected an overall debt of 0.7% and a per-file debt of less than 5% for all the source files which qualifies it for a rating of “A”. This result is indicated in the chart below:

Even with an “A” rating, improvements are possible. As we will see in the next section, the technical debt can be reduced even further by fixing issues related to code smells.

Code Smells

Code smells aren’t bugs in the code, but technical design decisions which might be a contributor to systems technical debt. ³ A total of 800 code smells were detected by SonarQube and these are categorized by their severity. Below is a list of the possible classes and examples of the associated issues:

Blocker code smells: high impact and high likelihood. [0 issues]
Critical code smells: these have a high impact but low likelihood. [462 issues]
- refactoring of functions to reduce the cognitive complexity from x to y.
- defining a constant instead of duplicating a literal. For example the string literal: builtins.str which appears in checker.py.
Major code smells: with a low impact but high likelihood. [202 issues]
- Merging if statements with enclosing ones.
- Removing or filling a block of code.
  - For instance in build.py line 2388.
Minor code smells: low impact and low likelihood. [136 issues]
- Various renaming suggestions to match an expected regex.
- Removing redundant continue statements.

Concluding the story, we have investigated the mypy code base in search of possible places to improve it. Of course there are many more areas where we could have gone deeper into mypy, but for now we will suspend our writing. No fear, we will return with our last blogpost about the variability of mypy.

mypy Documentation on testing. https://github.com/python/mypy/tree/master/test-data/unit ↩
Martin Fowler. TechnicalDebt. 2019. https://www.martinfowler.com/bliki/TechnicalDebt.html. ↩
Tufano, Michele, et al. “When and why your code starts to smell bad.” 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. Vol. 1. IEEE, 2015. ↩