NuGet package statistics

For a while, I’ve been considering how useful nuget.org statistics are.

I know there have been issues in the past around accuracy, but that’s not what I’m thinking about. I’ve been
trying to work out what the numbers mean at all and whether that’s useful.

I’ve pretty sure an older version of the nuget.org gallery gave stats on a per-operation basis, but right now it looks like we can break down the downloads by package version, client name and client version. (NodaTime example)

In a way, the lack of NuGet “operation” at least makes it simpler to talk about: we only know about “downloads”. So, what counts as a download?

What’s a download?

Here are a few things that might increment that counter:

  • Manual download from the web page
  • Adding a new package in Visual Studio
  • Adding a new package in Visual Studio Code
  • nuget install from the command line
  • dotnet restore for a project locally
  • dotnet restore in a Continuous Integration system testing a PR
  • dotnet restore in a CI system testing a merged PR

All of them sound plausible, but it’s also possible that they wouldn’t increment the counter:

  • I might have a package in my NuGet cache locally
  • A CI system might have its own global package cache
  • A CI system might use a mirror service somehow

So what does the number really mean? Some set of coincidences in terms of developer behavior and project lifetime? One natural reaction to this is “The precise meaning of the number doesn’t matter, but bigger is better.” I’d suggest that’s overly complacent.

Suppose I’m right that some CI systems have a package cache, but others don’t. Suppose we look at packages X and Y which have download numbers of 1000 and 100,000 respectively. (Let’s ignore
which versions those are for, or how long those versions have been out.) Does that mean Y‘s usage is “better” than X‘s in some way? Not necessarily. Maybe it means there’s a single actively-developed
open source project using Y and a CI system that doesn’t have a NuGet cache (and configured to build each PR on each revision), whereas maybe there are a thousand entirely separate projects using
X, but all using a CI system that just serves up a single version from a cache for everything.

Of course, that’s an extreme position. It’s reasonable to suggest that on average, if package Y has larger download numbers than package X, then it’s likely to be more widely used… but can we
do better?

What are we trying to measure?

Imagine we had perfect information: a view into every machine on the planet, and every operation any of them performed. What number would we want to report? What does it mean for a package to be “popular” or “widely used”?

Maybe we should think in terms of “number of projects that use package X“. Let’s consider some situations:

  • A project created to investigate a problem, and then gets deleted. Never even committed to source control system.
  • A project which is created and committed to source control, but never used.
  • A project created and in production use, maintained by 1 person.
  • A project created and in production use, maintained by a team of
    100 people.
  • A project created by 1 person, but then forked by 10 people and
    never merged.
  • A project created on github by 1 person, and forked by 10 people on github, with them repeatedly creating branches and merging back into the original repo.
  • A project which doesn’t use package X directly, but uses package Y that depends on package X.

If those all happened for the same package, what number would you want each of those projects to contribute to the package usage?

One first-order approximation could be achieved with “take some hash of the name of the project and propagate it (even past caches) when installing a package”. That would allow us to be reasonably confident in some measure of “how many differently-named projects depend on package X” which might at least feel slightly more reasonable, although it’s unclear to me how throwaway projects would end up being represented. (Do people tend to use the same names as each other for throwaway projects? I bet Console1 and WindowsForms1 would be pretty popular…)

That isn’t a serious suggestion, by the way – it’s not clear to me that hashing alone provides sufficient privacy protection, for a start. There are multiple further issues in terms of cache-busting, too. It’s an interesting thought experiment.

What do I actually care about though?

That’s even assuming that “number of projects that use package X is a useful measure. It’s not clear to me that it is.

As an open source contributor, there are two aspects I care about:

  • How many people will I upset, and how badly, if I break something?
  • How many people will I delight, and to what extent, if I implement a particular new feature?

It’s not clear to me that any number is going to answer those questions for me.

So what do you care about? What would you want nuget.org to show if it could? What do you think would be reasonable for it to show in the real world with real world constraints?

6 thoughts on “NuGet package statistics”

  1. What if, and this is a deviation from gathering statistics, you provide a way for package users to let you know their level of dependence on the package, kind of a “like” button, but say … a “depend” button which marks that a user’s product depends on a particular package. The end goal is to gauge magnitude of the package’s use with some degree of either accuracy or magnitude (or both ideally). So what if you let people explicitly say that they like something and rate how much they like it, or what their use entails.

    Like

  2. That is such a great question! I think that if it was possible, a good measure would be work hours. That is how many hours the package was in use anywhere. This solves issues with one-off projects as you mentioned as well as a download measure. It would also be possible to divide the value by the number of hours the package was available which would allow you to compare popularity of new packages vs. veteran ones as well as the different versions.

    The only question is, how do you report on that? Especially in cases where the package runs on machines with no connectivity out to the world.

    Liked by 1 person

  3. Sounds to me a good candidate for Google’s Page rank: We’re looking for important packages. The importance of a package can be determined as a function of two parameters:

    The first parameter is calculated non-recursively: Some measurement of how important it is.

    The second parameter is recursive: If many important packages depend on this package, it is indeed an important one.

    This also correlates well with a package’s trustworthy-ness: if many important packages depend on it, it should be pretty safe to use it.

    Of course I’ve not speculated enough and many questions can be asked: Should the recursion reference the total importance, or just the recursive parameter? How should we make it stable, etc.

    Like

  4. This makes me think of the site https://stackshare.io/ it’s certainly not a complete breakdown, but I used to help me find different tools for the same jobs and find how popular they are, the problem is that the results are far from complete as lots of the data is only user reported.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s