Debian, licenses, and license-count

A few people have been using the license-count tool that I wrote to analyze the breakdown of licenses in Debian and draw some conclusions. Unfortunately, some of those conclusions can't be drawn from the output of this specific tool (which doesn't make them wrong, just unproven). So here are a few cautions.

license-count was written for a specific purpose: to see how many references there are to a particular license in Debian so that we had some facts behind our discussion of whether to include it in common-licenses. It was not written to classify packages into particular categories, only to find out how many copies of a license we could save by adding it to common-licenses. This means that using it to determine how much software in Debian is covered by a particular license fails in the following ways:

  1. license-count numbers cannot be added. Since the goal is to find every reference to a given license, each package is not classified as under one and only one license. If a package has material under the GPL, the LGPL, and the GFDL, it will add one to each of those counts. This means that you cannot determine, from its output, questions like "how many packages are covered under one of the GPL family of licenses?" You will double- or treble-count packages when you add the numbers together.

  2. license-count counts any reference, not just "important" ones. If the license is referenced in debian/copyright, that counts, since the goal is to see how many packages might benefit from being able to reference it in common-licenses. This may not be the license that the package is under, just some small portion of the package.

    For example, all of my packages that use Autotools add to the GPL count because I include all the source code licensing statements, including the statements for libtool and the Automake helper scripts, which say that the files can be released under either the GPL or the license of the including package. These packages are generally under a BSD-style license (or in some cases the Apache 2.0 license), not under the GPL, but since the reproduced license text has a reference to the GPL, license-count adds one to the GPL column for each of those binary packages. For most purposes for counting GPL-covered software, these are false positives. I don't know how many people are as complete in documenting license statements as I am, but just the binary packages that I maintain probably count for at least 50 false positives because of this alone.

  3. license-count numbers include dual-licensing. Whether this matters for your analysis or not depends on what question you're asking, of course. If you're asking whether the package is covered under a copyleft license, including the thousands of packages that are dual-licensed under the Artistic license and the GPL may be okay, since the Artistic license is a sort of copyleft (although it allows you to avoid the copyleft requirement by renaming executables, providing the originals, and documenting your changes). But, for example, one of my packages is dual-licensed under a BSD-style license and the GPL (for reasons that aren't worth getting into). This package is not covered by copyleft under any reasonable definition of the term, but license-count will count it as a package using the GPL.

In order to draw interesting conclusions like "how much software in Debian is protected by copyleft," I'm afraid that you'd have to write a much more sophisticated tool than license-count.

Posted: 2012-02-19 09:30 — Why no comments?

Last spun 2013-07-01 from thread modified 2013-01-04