NuGet package statistics

For a while, I’ve been considering how useful nuget.org statistics are.

I know there have been issues in the past around accuracy, but that’s not what I’m thinking about. I’ve been
trying to work out what the numbers mean at all and whether that’s useful.

I’ve pretty sure an older version of the nuget.org gallery gave stats on a per-operation basis, but right now it looks like we can break down the downloads by package version, client name and client version. (NodaTime example)

In a way, the lack of NuGet “operation” at least makes it simpler to talk about: we only know about “downloads”. So, what counts as a download?

What’s a download?

Here are a few things that might increment that counter:

  • Manual download from the web page
  • Adding a new package in Visual Studio
  • Adding a new package in Visual Studio Code
  • nuget install from the command line
  • dotnet restore for a project locally
  • dotnet restore in a Continuous Integration system testing a PR
  • dotnet restore in a CI system testing a merged PR

All of them sound plausible, but it’s also possible that they wouldn’t increment the counter:

  • I might have a package in my NuGet cache locally
  • A CI system might have its own global package cache
  • A CI system might use a mirror service somehow

So what does the number really mean? Some set of coincidences in terms of developer behavior and project lifetime? One natural reaction to this is “The precise meaning of the number doesn’t matter, but bigger is better.” I’d suggest that’s overly complacent.

Suppose I’m right that some CI systems have a package cache, but others don’t. Suppose we look at packages X and Y which have download numbers of 1000 and 100,000 respectively. (Let’s ignore
which versions those are for, or how long those versions have been out.) Does that mean Y‘s usage is “better” than X‘s in some way? Not necessarily. Maybe it means there’s a single actively-developed
open source project using Y and a CI system that doesn’t have a NuGet cache (and configured to build each PR on each revision), whereas maybe there are a thousand entirely separate projects using
X, but all using a CI system that just serves up a single version from a cache for everything.

Of course, that’s an extreme position. It’s reasonable to suggest that on average, if package Y has larger download numbers than package X, then it’s likely to be more widely used… but can we
do better?

What are we trying to measure?

Imagine we had perfect information: a view into every machine on the planet, and every operation any of them performed. What number would we want to report? What does it mean for a package to be “popular” or “widely used”?

Maybe we should think in terms of “number of projects that use package X“. Let’s consider some situations:

  • A project created to investigate a problem, and then gets deleted. Never even committed to source control system.
  • A project which is created and committed to source control, but never used.
  • A project created and in production use, maintained by 1 person.
  • A project created and in production use, maintained by a team of
    100 people.
  • A project created by 1 person, but then forked by 10 people and
    never merged.
  • A project created on github by 1 person, and forked by 10 people on github, with them repeatedly creating branches and merging back into the original repo.
  • A project which doesn’t use package X directly, but uses package Y that depends on package X.

If those all happened for the same package, what number would you want each of those projects to contribute to the package usage?

One first-order approximation could be achieved with “take some hash of the name of the project and propagate it (even past caches) when installing a package”. That would allow us to be reasonably confident in some measure of “how many differently-named projects depend on package X” which might at least feel slightly more reasonable, although it’s unclear to me how throwaway projects would end up being represented. (Do people tend to use the same names as each other for throwaway projects? I bet Console1 and WindowsForms1 would be pretty popular…)

That isn’t a serious suggestion, by the way – it’s not clear to me that hashing alone provides sufficient privacy protection, for a start. There are multiple further issues in terms of cache-busting, too. It’s an interesting thought experiment.

What do I actually care about though?

That’s even assuming that “number of projects that use package X is a useful measure. It’s not clear to me that it is.

As an open source contributor, there are two aspects I care about:

  • How many people will I upset, and how badly, if I break something?
  • How many people will I delight, and to what extent, if I implement a particular new feature?

It’s not clear to me that any number is going to answer those questions for me.

So what do you care about? What would you want nuget.org to show if it could? What do you think would be reasonable for it to show in the real world with real world constraints?

11 thoughts on “NuGet package statistics”

  1. What if, and this is a deviation from gathering statistics, you provide a way for package users to let you know their level of dependence on the package, kind of a “like” button, but say … a “depend” button which marks that a user’s product depends on a particular package. The end goal is to gauge magnitude of the package’s use with some degree of either accuracy or magnitude (or both ideally). So what if you let people explicitly say that they like something and rate how much they like it, or what their use entails.

    Like

  2. That is such a great question! I think that if it was possible, a good measure would be work hours. That is how many hours the package was in use anywhere. This solves issues with one-off projects as you mentioned as well as a download measure. It would also be possible to divide the value by the number of hours the package was available which would allow you to compare popularity of new packages vs. veteran ones as well as the different versions.

    The only question is, how do you report on that? Especially in cases where the package runs on machines with no connectivity out to the world.

    Liked by 1 person

  3. Sounds to me a good candidate for Google’s Page rank: We’re looking for important packages. The importance of a package can be determined as a function of two parameters:

    The first parameter is calculated non-recursively: Some measurement of how important it is.

    The second parameter is recursive: If many important packages depend on this package, it is indeed an important one.

    This also correlates well with a package’s trustworthy-ness: if many important packages depend on it, it should be pretty safe to use it.

    Of course I’ve not speculated enough and many questions can be asked: Should the recursion reference the total importance, or just the recursive parameter? How should we make it stable, etc.

    Like

  4. This makes me think of the site https://stackshare.io/ it’s certainly not a complete breakdown, but I used to help me find different tools for the same jobs and find how popular they are, the problem is that the results are far from complete as lots of the data is only user reported.

    Like

  5. I sorely miss a changelog in NuGet package pages. Because that could indicate better whether a package was downloaded many times because of a good feature or an important security fix.

    Like

  6. Hi. I don’t care about downloads’ number for open source packages. I go to GitHub and look on number of stars, contributors, issues, last commits and documentation.

    Like

  7. I was pestering the nuget team in October-November of 2017 because stats weren’t updating. The Program Manager asked me what I do with stats, and this was my reply. From https://github.com/NuGet/NuGetGallery/issues/5089

    You asked on Twitter what I do with stats, and how often I check them:

    I check multiple times per day until I’ve seen them updated for that day, then stop until the following morning (I shout at you guys on Twitter when they don’t update, as you’ve noticed. :) )
    I record daily stats in an Excel spreadsheet, because I find the nuget page rather sparse

    What I would like to see is more akin to how you would track a sports team or distributed computing project:
    – Moving 7 day average
    – Moving 30 days average
    – Predicted download total in the next week, month based on current growth rates. Maybe a far metric like 90 days from now

    As the developer of a niche but important package that’s steadily growing in popularity, I also care about understanding two categories of use:

    User initiated a downloaded, e.g. by clicked install in Code or Studio
    Downloads initiated by a CI build or similar process

    I know teasing these apart is imperfect, but having a rough picture informs how I think about the development of new features, documentation, and legacy support.

    I also wouldn’t mind a high-level breakdown of “user probably manually downloaded the package” actions vs “probably downloaded as part of a CI build”. I realize this isn’t perfect, but…

    I would add today that I consider CI builds important to know about, because those users presumably have quite a lot of business value at stake, and I tend to weight their issues more heavily than someone who’s just playing around. (Although today’s experimenter can be tomorrow’s heavy user.)

    Like

  8. As the author of a package competing with the uncatchable Newtonsoft.Json, I find it completely unfair that package download count is the most prominent stat. I mean, just by starting a web project, I’ve downloaded the package. The choice isn’t even mine. Maybe I don’t want to use that serializer. But the damage is done: the package’s download count has been incremented.

    Maybe I’m just whining…

    Like

  9. I’m a package author also. I would like more stats and more feedback.

    It would be nice to see all the update checks from visual studio. This would be a good indication of usage and what version they are using.

    A way for users to provide feedback for a specific lib and version would be nice.

    Like

Leave a comment