Plotting Bokeh, an Analysis of its Collaboration

Product development involves two, fundamental, elements: a technical one and a social one.1 One can only dream of a successful project when both these components are harmoniously aligned. In previous essays we analysed both elements. We focused specially on Bokeh’s technical side, studying its properties, processes, development, deployment, architecture, code quality and testing.234 Nevertheless, we also shed some light on the social component, describing the importance of the different groups of individuals involved in the project.

In this last essay we will dive deeper into its social element, that is, how Bokeh as an organization handles collaboration and communication. Ultimately, this study will reflect on how and when the different individuals involved in the development process interact, and how can those interactions be used to enhance development productivity.

Why should we trust in communication?

The relationships between the software developers are as important as the relationships between the structure and the components of the software, since coordinating product design decisions requires good communication among the engineers who make and implement them.1

The core team is well aware of its importance, since the discussion on how communication should be handled has been a recurring matter. Core team member Mateusz Paprocki raises an important point in an issue comment.

Mateusz argues that discussions on GitHub bring transparency for the whole community, and their role should not be undervalued. Software Development literature is also not indifferent to this issue, as stated by Cataldo et al.1

The product development literature argues that information hiding, which leads to minimal communication between teams, causes variability in the evolution of projects, frequently resulting in integration problems.

Moreover, Mateusz expresses the importance of preserving valuable information such as discussions on design and architectural decisions in the fewest possible number of channels and tools. In fact, he acknowledges the importance of using a collaborative tool such as GitHub as a centralized communication device in order to preserve vital information, so nothing gets lost and everything can be widely and easily accessed.

Nevertheless, Bryan van de Ven, other member of the core team, voiced on several occasions56 the importance of interpersonal interactions and overall communication as a way to inform and learn from each others.

Additionally, on a recent interview we conducted, Bryan stated:

[…] once a project reaches a certainly level of size/maturity, all the important and hard problems are not technical ones, they are people/community ones.

We are aware of the fact that implementing good collaboration and communication practices is not an easy task for any organization. For an open source software it is even more challenging, as usually a main portion of the contributors is geographically distributed.

Ultimately, we argue that both types of communication discussed previously should not be underplayed and are equally important. However, we raise our concerns regarding the nature of that communication, and encourage transparency as a way to get users involved in the development process.

Collaboration analysis: the method

In this essay we will explore the communication and coordination patterns showed by Bokeh’s developers, in order to assess development productivity and thus predict how resolution time of modification requests can be reduced. Moreover, as Conway’s law recalls us, by studying the communication structure of the organization, one can make relevant inferences about the design of the structure of the final product.7

We will base our analysis on Pull Requests (PR) and Issues on GitHub, as it is the central tool in Bokeh’s development. We know, however, that development-related communication is also done in Zulip, Discourse and private channels, as explained in our previous essay.

To carry out our analysis we resort to the Congruence Framework introduced by Cataldo et al based on the idea that in software development there are two components, the technical and the social one, which need to be aligned to have a successful project.1

Socio-technical congruence can then be defined as the match between the coordination requirements established by the dependencies among tasks and the actual coordination activities carried out by engineers.1

This highlights two important elements that combined originate congruence: the set given by the description of which individuals are working on which tasks and the set regarding the dependencies among tasks. Both of them can be modelled using matrices that, once combined, output the coordination requirements matrix, representing the extent to which each pair of developers needs to coordinate their work. Having this matrix, it is then possible to compute congruence, defined as the proportion of coordination activities that actually occurred relative to the total number of coordination activities that should have taken place.

Nonetheless, we are aware that the level of dynamism in an open source project, such as Bokeh, is different from the one studied in the paper. This inevitably forces adjustments to the framework explained above.

First of all, instead of analyzing which individuals are working on which tasks, we investigate which modules were changed by which developers. This is a necessary workaround, as Bokeh doesn’t resort to task assignment policies. Furthermore, analyzing module changes instead of file changes allow us to explore previous work. After all, we have commented on Bokeh’s module structure before.

Secondly, when it comes to the study of relationships or dependencies among tasks, we apply 2 different methods:

  1. We use the module coupling analysis provided by Sigrid and introduced on our previous essays;

  2. We examine the set of modules that were changed together in a specific Pull Request, which we call Modules Changed Together (MCT) method, inspired by the Cataldo’s Files Changed Together (FCT) method.1

The reason to complement the analysis of syntactic dependencies (in this case functional calls between different modules) with the MCT approach is outlined in the paper. Cataldo et al argue that functional dependencies analysis captures a narrow view of relationships among tasks. The motivation for MCT is also rather straightforward:

when a modification request requires changes to more than one file, it can be assumed that decisions about the change to one file in a modification request depend in some way on the decisions made about changes to the other files involved in implementing the modification request.

Finally, to assess the number of coordination activities carried out by developers we analyse comments and reviews on Pull Requests and subsequent Issues referenced on the Pull Request. From now on we will refer to this combination as a Task. We define that a developer communicated with another developer if both commented on the same Task, regardless of whether they actually interacted directly, since we assume that developers read all comments on Tasks they also comment on.

All the data used in this study was collected using the GitHub API. We analysed every PR opened between the release of v1.0.0 (October 2018) until the release of v2.0.0 (March 2020) and all the issues related to each of those PRs. The analysed timespan is roughly 18 months of development. We supplement this study with plots, graphs and other visuals done in Bokeh - it is appropriate to use Bokeh to analyse Bokeh, right?

To be able to retrieve all of the necessary mentioned data we took the initiative, and engineered a tool. We built a pipeline that scraps the necessary data using the Github API, processes it, discarding what is not needed and computing what is relevant. The output of these processes is then directed to the front-end and displayed using figures created with Bokeh.

We (will) also make available the source code to reproduce these experiments. The interested reader can apply them to his/her favorite GitHub projects. Note that it is open-source, so we are more than happy to discuss improvements.

Collaboration analysis: the results

We start by doing an overall assessment of the data collected throughout the 18 months of development. During this period, we identified 700 PRs opened by 154 different collaborators. Figure 1 combines this numbers, showing how many PRs developers with more than 2 PRs opened.

1. PRs by developer

In addition, we identified a total number of 4876 comments on Tasks, and divided this number by developer.

We also collected data on how many reviews each developer did (Figure 2), on a total of 549 reviews.

2. Reviews by Developer

Finally we investigated how many times every each of the 54 considered modules was changed.

3. PRs by Module

Thanks to this dataset, we were able to build the matrices according to the procedure explained in the previous section as well as the coordination requirements matrix. This matrix compared against the actual coordination activities matrix, provides a measure of congruence.

We start by analyzing a developer by modules matrix (Figures 4 and 5). Each cell ij indicates the number of changes that the individual i did to module j. We refer to this matrix as Ta.

4. Developer by Modules Matrix percentage

5. Developer by Modules Matrix absolute

At a first look we can see how a small set of developers is in charge of a large portion of the changes that took place during the 18 months.

The members of the core team, Mateusz and Bryan together touched on every module. Mateusz gave more contributions to BokehJS (Javascript) while Bryan to Bokeh (Python). They are the main contributors to the project and its principals pillars.

Besides Bokeh module, the module Bokeh/models was changed by a lot of different contributors, which makes sense since Models represent the building block classes.3

Next, we compute the matrix that describes which modules are changed together according to the inspected PRs (Figure 6). We call it Td. Each cell ij (or cell ji) in Td indicates the number of times that module i and j where changed together.

6. Modules by Modules Matrix

It is possible to see that, for instance, the pairs of modules Bokeh/util and Bokeh/embed, Bokeh and Bokeh/core are usually changed together. In fact, each pair is tightly coupled together if we resort to the syntactic analysis provided by Sigrid. It is also interesting that the last pair of modules, besides being usually changed together, share cyclic dependencies.

On the other hand, the modules Bokeh and Bokeh/_testing are not tightly coupled when it comes to Sigrid’s analysis, but are usually modified in the same Pull Request.

Now that we have matrices Ta and Td we can compute the coordination requirements matrix, Cr (Figure 7). We have that Cr = Ta x Td x TaT, with TaT being the transpose of Ta.1

7. Matrix Cr

Finally, we show the matrix that specifies the coordination activities that actually occurred, Ca (Figure 8).

8. Matrix Ca

Cr gives us information on how much each pair of people need to collaborate in order to coordinate their work. The bigger the value in each cell, the more we expect both developers to interact with each other. We compare it against the actual number of coordination activities carried out by the developers, Ca. As previously discussed, the identification of which set of individuals should be coordinating activities is an important step in enhancing productivity and quality in the development process. When collaboration is not properly evaluated, several problems can arise:

Information hiding led development teams to be unaware of others teams’ work resulting in coordination problems.1

Analyzing Cr, one can see that it is expected that Mateusz and Bryan coordinate a lot their activities, as they are the developers which contribute the most. They are also expected to be the bridge between Bokeh’s development and new contributors. Unsurprisingly, this is actually the case, as can be seen by Ca and Figure 9. Bryan and Mateusz are the central pieces of this network.

9. Communication network in Bokeh. Black nodes represent Mateusz and Bryan. Green nodes other members of the core team. The blue node represents one of the authors of this essay.

Congruence could then be computed following the formula present on the paper1, which would give us a value between 0 and 1 representing the proportion of requirements that were satisfied.

Bokeh and BokehJS, an analysis

In the roadmap for the future2 we stated that a main goal for Bokeh’s core team is to develop BokehJS as a first-class Javascript library. This was also referred by Bryan in the interview:

I’d also like to continue to encourage BokehJS as a pure JS tool that can be used on its own. My hope is this might attract new JS contributors to the project.

The question that arises is: “Are Bokeh and BokehJS able to evolve separately?”. Our analysis showed that changes in BokehJS are usually followed by changes in modules on the Python side. Comments on some issues are also prof of that. This inevitably shows that in order to BokehJS evolve as a separate tool, coordination in communication and collaboration needs to fill an extremely important role on the future of Bokeh.

Shortcomings and conclusion

We identify some shortcomings with out method.

First, we deliberately decided not to analyse all means of communication. However, doing that could have given us different and relevant new insights.

Secondly, we didn’t use label information to filter Pull Requests. This could have given us the ability to only focus on feature additions, which would be in itself interesting.

Finally, we did not take into consideration that usually the work in an open source project is not evenly distributed among all contributors.

Nevertheless, this essay allow us to conclude that communication represents a fundamental element in the development of a software and should not be undervalued. Good coordination and collaboration practices are also mandatory, since communication overhead can be detrimental for projects of considerable size.

  1.  2 3 4 5 6 7 8 9

  2.  2

  3.  2





Alfonso Irarrázaval
Andrea Monguzzi
Guilherme Fonseca
Miguel Cardoso