Hacker News new | past | comments | ask | show | jobs | submit login
3500 packages uploaded to PyPI, pointing to a malicious URL (twitter.com/datanerdery)
257 points by DyslexicAtheist on March 2, 2021 | hide | past | favorite | 128 comments



One of the things that the recent Python cryptography debate has highlighted to me is how much we depend on this distribution of libraries like this. In that case, it's normal to just automatically get updates from a product and then one day, a whole bunch of software projects suddenly notice and, luckily, break (lucky, as opposed to being compromised). In this case, it's normal to just type a project name in install software, with very little vetting done by many people. I want to work on someone's web app, npm suddenly downloads the world. Who's actually audited all that? I know I haven't.

Not sure how we could fix it without slowing way down and doing a lot more work.


> Not sure how we could fix it without slowing way down and doing a lot more work.

Unfortunately, this is pretty much the only way, besides the "walled garden" approach.

I am often derided as being a "grumpy old control freak." They def have a point, but I am very, very leery of bringing code into my projects that I don't understand, and stepping through source takes almost as long as writing from scratch.

Most dependencies cover a wider range of functionality than the narrow application that we seek in our individual projects, so, to truly understand the dependency, we need to examine all of its faces.

I recently had to yank a dependency out of a project that I'm developing, because it had a regex bug that caused memory leaks and random crashes in worker threads. It would have been way too difficult for me to figure out what was happening, so I removed it, and wrote my own functionality that is not as good, but doesn't crash my application. This is the kind of behavior that gets me scorn. I'm supposed to keep the crash, and shrug.

So, yeah. I guess I am a grumpy old control freak...


Can you believe how much work it is to even get repeatable builds out of these systems and people?


Random tangent: I was poking around your blog and found the entry "1985 Dijkstra interview"[1] and I was struck by this line:

> The net effect of it seems to be that a full system for really acceptable programming will be at the same time a full system that will suffice for the description of constructive mathematics.

He said that in '85!? Wow.

[1] https://maniagnosis.crsr.net/2007/10/1985-dijkstra-interview...


> a full system for [knowably-non-buggy] programming will be at the same time a full system that will suffice for the description of constructive mathematics.

That's basically the Curry-Howard correspondence[0], which was first explicitly stated in 1980.

0: https://en.wikipedia.org/wiki/Curry-Howard_correspondence


> stepping through source takes almost as long as writing from scratch.

"But why would you do that?" and "But thats the point of using those libraries!" right?

when you're the only one that can find the heisenbug and fix it; your effort and the need for it aren't evidence of over-reliance on dependencies. They are an argument for re-implementation from scratch in brand new, clean, safe technologies!

you may have started a howl


Well, the classic argument for open-source, is that the "customer" can always pop open the hood, and look at the gears and whatzis.

This is seldom actually done. More frequently, customers will find bugs, and fix or report them, thus, improving the tool for all users.

In this one case, it was not reasonable for me to do this. The application that I'm developing is closed-source (for now), and I was in the middle of a pretty massive refactoring job. There was no way that I was going to take a couple of days to try to nail down a random memory leak/thread safety bug in someone else's code, when I could simply use the Apple OS tool to do something similar, and I knew it would be safe.

I definitely could have figured it out (given time). I'm a very good debugger, but it would have been an unacceptable branch in my workflow. I have a pretty full dance card, and this was not on it. Thread safety issues can be a pain to track down, and they are like cockroaches; for each one you see, there are a dozen more, behind the drywall.

I wasn't about to raise a ruckus about it. I suspect that the tool works fine in standard main thread apps, and my application uses worker threads that originate from things like network callbacks. The author won't shut down the tool, because one random person with a specific workflow has an issue. I have managed open-source projects, and know what it's like to be on the other end of that crap.

The biggest issue with dependencies is trust. That's why walled gardens are attractive. There's a chain of responsibility, and some kind of accountability baked into them. Even then, some of these libraries can get compromised, or sold off to nefarious actors. Also, I suspect that many of these SDKs and dependencies are really "first hit is free." They can make us dependent upon a person, company, or whatever, that does not have the best interests of our user at heart.

I tend to rely on manufacturer API/toolsets, and even then, I'm really careful. Some of the SDKs that ship with devices and services can be...questionable. But if I find myself blaming the compiler for an issue, I can be reasonably sure that the actual problem is mine. If I am using a blackbox tool that sits between me and the iron, it can be difficult to figure out where the problem lies.

Just a couple of days ago, I had to change the hosting provider that runs the backend for the application that I'm developing. I wrote that backend, and there was definitely some kind of bug, but the hosting provider decided to introduce a new "can't be switched off" inline cache that completely broke interactions with the backend, so I couldn't find my bug. If the backend had been a server that I don't control, it could have taken many more days to even figure out that the bug was in the host, report it (or diagnose it), and then wait for the inevitable back-and-forth before it was addressed. Since my primary work is the frontend app, that would have meant a huge delay. I figured out the cache issue, because I am intimately familiar with the backend, and could easily diagnose problems, using the classic "divide and conquer" methodology. Once the hosting provider was switched, it took me half an hour to find and fix the bug.


> I could simply use the Apple OS tool to do something similar, and I knew it would be safe.

Even that assumption is being eroded


Sadly, you have a point.


I don't think that understanding your dependencies is a reasonable criterion (the intellectual scope of your dependencies should be beyond your intellect for all but very simple projects). Furthermore I think that your approach essentially rejects the good engineering/design principles of modularity/separation of concerns (it is _good_ to black box things).


Depends on how much understanding you're talking about. I don't need to know how to build an x86 chip, but as an experienced and educated professional programmer I should probably have a decent sense of which kinds of instructions are time consuming and which aren't, how the cache works, etc.

Abstractions are leaky. You should strive to understand where your black box is transparent.


>I don't think that understanding your dependencies is a reasonable criterion

I agree, I've never gone as far as reviewing/understanding source code for all the libraries I use. Might be nice to have the time ;-)

>it is _good_ to black box things

I trust that you are talking about using other people's black boxes, as opposed to reinventing your own?

I would say that there is some element of risk in trusting a library that someone else made, particularly over time as systems need to be upgraded. Will there be problems during or after an upgrade? Is the whole dependency chain upgradable? Will the upgrades work on all operating systems in use? Are there relevant corporate policies and/or whitelists that might interfere with or slow down upgrades?

Personally I pay close attention to the dependencies I'm using in a project, especially long-lived projects. I find it super aggravating when something I've tested and deployed breaks a couple years later after routine upgrades due to some kind of drift problem to do with a dependency.

How much benefit am I getting by using the library? Best is using base library that comes with the language. Next best would be a well maintained third-party library with a good reputation in the community, one that doesn't have too many dependencies itself. Am I happy with the api of the library, and its documentation? Is is reasonably easy to approach given its functionality? Is the library available as a package in the operating systems I'm working with? How long has it existed? How stable has it been over time? Is the library a good fit for what I am doing, or do I have to bend my code to use it as intended? Do I have multiple libraries to choose from? Can I use the library in such a way that I can change my mind later?

I think there is value in trying to minimize the use of external libraries, up to a point.


"Might be nice to have the time ;-)"

It's not about having the time, but -making- the time. For the most part we simply do not prioritize looking into the source code of the libraries we use, to any meaningful degree. (I'm guilty of this for sure). Hence things like leftpad and the many other instances of dependency controversies.

In an ideal world we [and our employers] would put emphasis on inspecting the huge amounts of code we pull in and allocate time for it.


How would you ever have time to write new code?


if you don't write new code, you also don't pull in new dependencies, and have time start new code ;) And cost to dependencies makes you optimize what you pull in.

(Cost to dependencies doesn't have to be just review time. E.g. in my embedded software projects, the calculus for "will we need to update this" looks very differently than it would for your typical web app. And extracting interesting bits from libraries and including only that into the main project is maybe more common, reducing the surface of a dependency - which would be a bad idea if you expect frequent changes, but these projects typically don't)


I have plenty of time to write new code. I write code (often lots of code) every single day.

I don't mind dependencies, but I won't use them for anything mission-critical, unless I spend a great deal of time vetting the dependency. That might mean auditing the code, but more often, it is auditing the coder.

The main thing is, is that we need to really think about anything we add to our precious project, from outside. In my experience, people seem to be alarmingly casual about this.

Good results can be building very big systems, in a very short period of time, using very few resources, and everything works great.

Bad results can be building very big systems, in a very short period of time, using very few resources, and everything works great.

Until it doesn't.

For example, an upgrade introduces things like renamed CSS hooks, or introduces thread safety issues. A PR isn't vetted properly, and license-problematic code, or malicious code is introduced into the system. Maybe the author has a car accident, and can't manage the library anymore. Maybe a dependency down the chain gets taken over by someone that uses...let's say "solarwinds123" as the password for their CI server, and they get pwn3d, but, since they are buried three levels down, no one in the main dependency realizes what happened, and so on. Maybe you get to a point where you outgrow your backend, and need to change to a more robust or secure backend, but the SDK you are using has you addicted to the initial backend that you selected, when the project was still in the crib, etc.

It should not be a casual decision.

In my original post, I talk about having to yank a dependency.

It was a phone-number parser. If you know anything about phone numbers, parsing them is non-trivial. Apple has a built-in utility (a scanner for URLs, phone numbers, addresses, etc.), but that is not as effective as the parser I imported, which had a fairly good pedigree. I looked at the code, and checked out the author, and it looked good.

The issue that I encountered was most likely (can't be sure) a thread-safety issue. I ran parses in non-main threads, and noticed memory leaks, and random crashes.

I do not suffer memory leaks or crashes of any kind, in my apps, so it had to go.

Phone number parsing wasn't a dealbreaker. Because it was a dependency, its application was fairly encapsulated (I always encapsulate external dependencies -just plain old horse sense), so I can go back and maybe reinstate it (if I can fix the issue, or use it only in main threads), or add a different one in the future. In the meantime, the Apple tool is fine for my development.

The backend for the project, on the other hand, is mission critical, so I wrote it myself. I am not a backend expert, but there are some things that I needed assurance on, that no backend engine can give me. It's highly encapsulated and layered, so future efforts can replace it. I can tell you that whatever it gets replaced with, will need to meet my exacting criteria.

My current project has one external dependency (it has quite a few, that I wrote): A keychain abstraction. It's very simple, and is actively maintained. I've used it in other projects for years.

The dependencies that I wrote are pretty intense; even the small ones. The testing code dwarfs the implementation code in all of them. One of them is an SDK for a pretty vast system, where I was the original architect, but has since passed on to a new team that has my complete confidence; so I guess you could also call that an "external" dependency, but one that I know very, very well.


A good black box relies on the assumption that the inputs / outputs are well documented and true to their spec.

This is a very unsafe assumption, at least until we invent a programming language that disallows bugs of any class.


What's the point of a philosophy that is impossible to live up to. Nobody audits a substantial fraction of their dependencies.


> Nobody audits a substantial fraction of their dependencies.

I do! In fact if we exclude the language implementation itself[0], a plurality of my projects have NaN% of their dependencies fully audited.

0: ie, compiler, hardware, etc; if you object to this, please do explain how to eliminate these dependencies; I'm quite interested.


> What's the point of a philosophy that is impossible to live up to. Nobody audits a substantial fraction of their dependencies.

It's a choice to be ignorant of your dependencies, not a law of nature.

The fact that most people don't audit code they depend upon is a problem to be fixed, unless the underlying code is fixed first.


> it is _good_ to black box things

... to teams you can 100% trust and whom are in the same sphere of responsibility. So yes, take a binary library blindly from the team down the hall who reports to the same VP.

From an unknown rando on the internet, not so much.


The only practical way is to have a centralized vetting organization that can give a trusted stamp of approval to signed packages. That would still slow things down, but at least wouldn't require duplication of work by everyone on a massive scale. You could outsource your vetting work to a distribution you trust.


I was trying to find a solution to this problem for my organization and came across https://tidelift.com/. I like some of their ideas and think it's a solid path forward at chipping away on this issue.

https://libraries.io/ is a project of theirs I use quite often when vetting third party dependencies for our organization.


The only practical way is to run all software in sandboxed containers that statically link everything they need and cannot harm the system via bugs/exploits from outdated/compromised libs.

In the past, there wasn't much software or much memory, and software was simple. Now, memory is plentiful, there's too much software, and software is too complex. A central authority doesn't scale. Fixing our naivete about how much to trust software does.


Software that I can't put data into or take data out of isn't very useful. You can mitigate the splash damage something could do and reduce the attack surface, but at some point I'm probably going to be feeding that process some files or data stream and I'm going to have it write out some files or data stream. Those data entry and exit points are still exploitable, or acting on the data stream could be harmful. Sandboxing is good to do, but ultimately isn't the solution to the problem of malicious code. Otherwise its pretty much just a space heater.


I can kind of see what you're getting at but a pure function can both accept and return data without any risk of this data leaking anywhere unintended.


Data leaking isn't the only thing you may be worried about. What about a random number generator designed as a pure function that skews a certain way reducing your cryptographic integrity? What if what looks like a pure function from the outside you're using for some range estimate on some ballistic payload is able to detect its in a live fire scenario and changes its outputs leading to inaccurate detonation? We're talking about malicious external packages/libraries and what not coming in, assuming everything is a pure function in the compiled black box of obfuscated code probably isn't a winning strategy on keeping your system secure. Of course, everyone's threat profile and the risks of failure vary greatly based on the application being deployed.


You're definitely right that decreasing the attack surface will be an important element to improving the situation, but it doesn't really solve the problem completely. Software will always need operational room to do anything useful and in so doing will still offer an opportunity for exploitation.

Increasing the trust we are able to place in the tools and libraries we leverage isn't a complete solution either, but it will have to be an element. Currently we've got nothing but blind faith and crossed fingers.


The issue that runs into is a mirror of the one raised by the pyca/cryptography business. For which CPU and hardware architecture should they be verified? On which OS? With which set of dependencies, including dependencies on non-Python compiled libraries?

Presumably all of those will need to be verified as well, and the problem recurses.


Every different CPU architecture and OS architecture is a different package. Software distributions already know this. The individual packages need to be verified.


At least with Python, this is true only if you're installing pre-compiled wheels which is quickly become more and more the norm. For a while though several packages were compile on install so you wouldn't necessarily have a different package per architecture/OS combination.


You don't have a different package name, but you for sure have a different package. Take, for instance, the keyword "long". 64 bit on amd64, 32-bit on armhf. So even source distributions to be compiled locally differ based on target, even if they have the same coordinates in pypy.


I agree the resulting binaries will be different but I disagree that these are then different packages. The package is whatever the distributor distributed it as. They distributed it as source, which you took and compiled to your local binary version. With "The individual packages need to be verified", and each resulting different binary being uniquely a different package, this means the package/repo maintainer of a source package would need to verify it for every single platform which compiles C and every potential compiler and every potential set of flags for the compiler as each set of platform/compiler/flags could result in a different binary and thus different behavior.


Yeah, if I were setting up a verification service there's no way I would verify source code instead of compiled binaries.


That sounds like it would have the app store problem. If you are successful, you are a monopoly target. If you miss an attack, you are to blame. If you don't approve fast enough, you are blocking.


The entire point of such an organization would be to accept the blame if an attack was missed. But there should be no natural monopoly since anyone could offer this vetting service. And as an end user there would be nothing preventing you from circumventing the repository of such a service (except for the risk you'd be taking on for yourself).


Scapegoat as a Service?


Sure, you could look at it like that. But their only value would be if they're never (or at least very rarely) actually called upon to act as a scapegoat. All the incentives would be for them to do their job and keep compromised software out of the ecosystem. Of course nobody is perfect, but they couldn't be flippant about it and survive.


What are the incentives? I honestly can’t tell.


The only reason for such an organization to exist is to be a clearing house for software; adding a stamp of approval that it is safe and hasn't been compromised by nefarious actors. Such an organization might be funded by a trade association, industry sponsorship, or direct fees to end users. But whatever the case, it would be strongly incentivized to fulfill its mandate or else lose credibility, reputation, influence, and financial rewards.


> Who's actually audited all that?

That's the point of software distribution maintainers. They are actual humans who must essentially sponsor a project before it becomes a package in a repository. They take responsibility for the packages they maintain. Users trust these humans since they usually don't make mistakes like letting literal malware into the software repositories.

Most modern languages are deliberately designed to be incompatible with this model. They optimize for developer ease of use and developers don't like getting approval from other people in order to publish their code. So they make their own little isolated worlds where everyone can upload anything they want.

Honestly it's amazing it took this long for people's trust to be exploited.


Languages aren't incompatible at all. People choose to use excessively liberal package-mangement systems.


> Not sure how we could fix it without slowing way down and doing a lot more work.

You restrict what your dependencies can do in the first place, so that if they're malicious (or just buggy) the scope of what they can do is limited. This doesn't eliminate the risk entirely - after all, it's possible for a library to introduce vulnerabilities just by doing its job incorrectly - but it massively limits the scope of what you'd need to audit. Right now, any dependency can do anything, so you'd need to audit all of them.

See POLA Would Have Prevented the Event-Stream Incident[0] for more explanation and LavaMoat[1] for an example of tooling that's trying to tackle this problem.

Now, this might count as "doing a lot more work" since it's admittedly not quite as simple as just typing in a project name. It's much less work than rewriting everything from scratch though. :)

[0] https://medium.com/agoric/pola-would-have-prevented-the-even...

[1] https://github.com/LavaMoat/lavamoat


> Not sure how we could fix it without slowing way down and doing a lot more work.

Can create a market for library audits that the library's team would pay for and provide "vetted" tags for libraries that have audits on minor updates.

Things can continue as they are for smaller libraries, but any library that's used in more critical areas should have no issue finding audit sponsors.

This would also create in incentive for somebody to develop automated auditing tools.


That's what anaconda is.


Many people here say that slowing down is a must -- and I agree it's probably the best solution -- but surely there are more approaches we could think of:

* Not allowing packages with similar names to popular ones

* Not allowing packages creation to be anonymous (in the extreme case you would require to validate your passport or similar)

* Automatic detection of malicious code

* Central auditing organization ...

This is just on top of my head, there must be many more ideas.


Well you can't, slowing down is a must. Blindly taking packages of executable code to be embedded in your code from anyone and everyone in the world who can publish a dependency is very flawed. It only mostly works because most people are nice most of the time but that's not a process one can trust.

I understand the desire for these free-for-all package distributions where every tool has it's own library downloader and anyone can publish anything... but the model is inherently incompatible with security.

On the provider side, there needs to be a strict process of validation and code reviews to get any code changes into a public library. On the consumer side, there needs to be the will to take personal responsibility to curate everything that is imported and minimize dependencies.

There's no shortcut to secure code.


This is why I absolutely hate not pinning down versions/hashes.

For example, with `cargo`, let’s say a library I require passes a security audit. However, because a library it requires, or one of its dependencies requires doesn’t pin down a specific version but instead just requires “latest” or “2.1.”, my security audit is for naught given that malware can slip in any time for 2.1..

That goes for testing too. I’ve tested my software etc, but sometime down the line a transitive dependency updates and adds a bug. Now my software that i haven’t changed, is broken.

Software requirements without pinning is a code smell.


With Cargo, the final binary (or artifact if you're building a .so/.a/.dylib/...) has a `Cargo.lock` file that pins every recursive and non-recursive dependency. If you audit the packages in that, you've audited every dependency that is used in your project, and they will not change unless you yourself explicitly tell them to.


Oh! TIL, thanks!


If your direct dependencies pin their dependencies, then you will definitely end with conflicting versions for your transitive dependencies.

This might work in environments like Node, where each library having their own private versions of their dependencies is acceptable, and Rust/Cargo in some situations, but doesn't work in environments where only one version of a package can be present (e.g. Python) or you care about the total size of the dependencies.

Also if you're auditing, wouldn't you audit all dependencies? What does it matter whether they are pinned by your direct dependencies, wouldn't you yourself pin down the transitive dependencies after auditing them?


> Also if you're auditing, wouldn't you audit all dependencies?

What’s the point of a transitive dependency can pull the rug out from under your audio with a single update


Don't update without auditing the new versions? I don't understand.


Who's actually audited all that?

I'm sure it's insufficient but you asked who audits all that (re: npm) well, apparently somebody.

https://docs.npmjs.com/cli/v7/commands/npm-audit

The audit command submits a description of the dependencies configured in your project to your default registry and asks for a report of known vulnerabilities. If any vulnerabilities are found, then the impact and appropriate remediation will be calculated. If the fix argument is provided, then remediations will be applied to the package tree.


This provides a list of well-known vulnerabilities, it doesn't actually audit the source code to find unknown vulnerabilities.


The business case for accepting the risk of downloading untrusted software has proven time and time again to be worth it. Without NPM modules, time to deliver would be much, much longer. If it was that bad of a problem, we would've collectively moved away. Many companies privately invest in their own registries that proxy to NPM, with an intermediary security layer. The maintainers of NPM have shown they'll cleanup the left-pad's and major security issues when they occur.


Without NPM modules, time to deliver would be much, much longer.

This is a circular argument. “We chose a language that has millions of tiny dependencies now we need a tool to manage millions of tiny dependencies”


If you must ship JavaScript for the web, you don't have much of a choice in languages (you might choose a superset like TypeScript, which is still distributed via the NPM ecosystem). For server-side code, agreed. You don't have to choose Node.js or JavaScript or NPM there.


Would anyone pay for a curated service?


https://github.com/pypa/pypi-support/issues/923

"The package only contains __init__.py file, that says:

# the purpose is to make everyone pay attention to software supply chain attacks, because the risks are too great."


Did you miss the statement (and relevant code snippet) immediately preceding that?

> The project contains a setup.py file that sends a request to a malicious URL during installation.

I can't comment on whether the URL is actually malicious or, perhaps, just logging requests for statistics / tracking.


"Sends a request to a malicious URL". Malicious is relative of course.

The code reads:

    url = "http://101.32.99.28/name?<package_name>"
    requests.get(url, timeout=30)
So it looks like the author wants to track the number of installs. Nothing is done with the response value (at least for the setup.py that I saw).

Edit: seems to be a Tokyo based IP.


The IP is from "Tencent Building, Kejizhongyi Avenue". That's weird, if it was just because they were an ISP, why would it say "Tencent Building"?

https://ipapi.co/101.32.99.28/


Ultimately up to the owner of the ASN what to use as the organization name. In this case, the ASN chose the name "Tencent Building, Kejizhongyi Avenue", you can see it over at https://bgp.he.net/AS132203 (The ASN is the owner of this specific IP under their specific ranges that they own)


Looks like the first line of the address for the headquarters of Tencent Cloud in Shenzhen.


That surprised me too (the IP lookup I used said "domain: tencent.com") but I don't understand enough about IP to tell whether that's unusual or not.

Unrelated tidbit: tencent.com returns an empty reply for me at the moment.


> tencent.com returns an empty reply for me at the moment

That's because the correct address is https://www.tencent.com/ (prefixed with www.)


Wow. Can the massive chinese conglomerate really not forward that to www?


They are hardly the only site to do that. It is only by convention that non-www-prefixed websites tend to redirect to www-prefixed one. It's not at all mandatory and technically they are two different hostnames. It may be intentional to ensure only exactly the right hostnames are ever used, e.g., www, ftp, smtp, etc.


Rather unusual, but it’s a corporate rather than customer-facing website so not really all that important. It’s not like apple.com, google.com, or microsoft.com where the brand name domain is the product or is where the products are sold. I just checked and apex domains for all their flagship products redirect to www just fine.


Seems to be working as intended, it's a non www url.


I rarely see a site that doesn't redirect it to the appropriate subdomain. Then again I don't often check it unless it's broken


Oh, nice catch! Strange setup.


Assuming requester IPs are being logged, this is a good way to build a map of potentially vulnerable organisations. No need to do anything now.


Hi thanks for sharing. The above I noticed and I thought it was captured quite well by the title. Thank you for flagging though it is useful context. My overall assessment is that this is grey/whitehat but merits further investigation.


It returns a null binary. Of course, this could be a targeted attack though.


The setup.py that I saw doesn't do anything with the response of the HTTP request so if it were indeed a targeted attack it would need to exploit some kind of weakness in the network stack. I guess that makes a targeted attack quite unlikely.


I went to the IP and it sent a single file (1B) containing a line break. I don't know if they changed it recently, but that's what happens on that URL right now.


Using which HTTP client?

To test "correctly", you'd likely want to use the requests library or, at the least, ussend the same User-Agent header that it does.

(And, for all we know, they could just be targeting certain IP addresses... or only responding the "malicious" response the first time the URL is requested per IP... or some other weird conditions that we aren't aware of.)


The packages do nothing with the response from the code i've seen


Ah, I see. I was just using Firefox to test things out; I'm not really experienced with these things but I do know how to use curl to an extent. :D


If we are being honest, nobody here can really answer that. The requests can act benign until the attacker chooses to change the payload based on source IP, time of day, source network, user-agent, etc... I believe it is best to block any outbound network connection you did not define in your code. e.g. your code uses a filtering proxy for approved API destinations.


Here’s cupy’s postmortem— https://github.com/cupy/cupy/issues/4787

Salient points being that cupy releases a new named package for each cuda version, so future package names are of course predictable. Since PyPi doesn’t allow namespacing, cupy’s plan is to register new names ASAP when cuda releases a new version and monitor and report other packages purporting to be cupy that get uploaded.


That's interesting. There is a policy and process (PEP 541) in place for addressing this, and it seems to have been executed swiftly and responsibly. Is this incident an argument for namespacing, or that the status quo is good enough?

I feel like domain name disputes tilt disproportionately to trademark holders, so I wouldn't like to see, e.g., cupy block a cupyd package or vice versa (or, for that matter, NVIDIA somehow strongarm cupy). On the other hand, you'd like a mechanism by which you can trust that a package comes from the cupy maintainers.


Why not pre-register? Why wouldn't attacker pre-register?


> PyPi doesn’t allow namespacing

Yet. I suppose it might wind up vaporware, but the feature request is under discussion/development.

BTW, it's PyPI.


This looks like someone trying to replicate Alex Birsan's attack on Apple, Microsoft, and others but through package confusion instead of version squatting. Reasonable speculation:

1. Host thousands of typo packages that phone home every time they are installed. Store IPs.

2. Find which companies own those IPs, filter out residential

3. Report vulnerabilities

[0] https://news.ycombinator.com/item?id=26087064



Alex Birsan’s attack was much more involved and raised issues less recognised by a vast number of communities members. This one is basically typosquatting, a well known issue that Birsan’s manoeuvre did not focus on.


How hard is it to use a VPN?


The user account in question is named “RemindSupplyChainRisks”, so it’s pretty clearly at most a grey-hat scenario.


You may be surprised to learn that even bad actors can choose innocuous names! They often don't if coming up with and maintaining names tends to be more expensive than it's worth when you're just going to burn them as they get caught, but it's totally viable for e.g. a targeted scenario.


I would think that an actual bad actor would choose an inconspicuous name, not a clear “IAmAGreyHat” name. But, of course, that’s what they’d want me to think, so who knows.


Continue and soon you’ll end playing an intelligence challenge in front of a iocaine-poisoned glass of wine :-)


Jokes' on them, I've spent the last few years building up an immunity to iocaine powder.


What would be an inconspicuous name in this scenario? Unannounced typo-squatting is at best a greyhat activity - as the urban legend goes, better to be caught shagging the sheep than rustling it.


I would simply name myself IocainePowder


Assuming the 'malicious' URL is really just statistics collection, I am really interested about the post we will hopefully get next week, with a few nice plots.


I think this noise isn’t really a problem unless people misunderstood pypi.

Pypi is not a white list, it’s just an index.

People shouldn’t randomly download packages without understanding and verifying EVERY one.

I think this is lazy devs and users who mixed up the App Store with pypi. These “exploits” aren’t very useful other than helping people understand these problems.

It seems like an easy solution that some company could set up to test and verify packages and create a white list service that I could subscribe to.

I currently use dependency checkers like safety and snyk that will alert me to real vulnerabilities like when someone takes over a package.

That I care about. We’re someone to take over pandas and put malicious code, that would be damaging.

That’s also why I pin all packages to specific versions and don’t update unless necessary.

I think this can be solved by better training.

I worry that these articles are some new form of FUD against Python. So far, I’ve gotten some non-tech security folks ping me about “Python vulnerabilities” because they read these articles. But it’s pretty easy to evaluate if my org is actually at risk by reviewing the dependency graph and seeing that no one is using these 3500 bogus packages.


Well you probably know your top level requirements, but what about their dependencies, and subsequent dependencies? I guess it could happen that a popular package gets corrupted. My current notebook venv has 87 packages installed, I can't vouch for all of them...


For production uses, I check them all. It’s a pain, but not that hard to do.

I also use dependency checkers that monitor al those various packages for CVEs related to particular versions. I think I typically catch them during dev and GitHub does a pretty good job of detecting and even suggesting updates.

There is a risk that a package gets corrupted at a used version, but I think that’s fairly small and would be detected quickly. But I think that’s similar to what would happen if Microsoft or Apple put out a bad update. And pypi is policed better than commercial setups that don’t have checks and balances (eg, SolarWinds)


> People shouldn’t randomly download packages without understanding and verifying EVERY one

yes but people do and all the training in the worldn't isn't going to help - the package manager needs safe defaults that prevent this sort of thing.


What can the package manager do to prevent someone from typing in the wrong package name?

There are some package managers like R’e CRAN that are really thorough on what gets in, so maybe if pypi got more volunteers they could start a “tested” tier where updated require test suites and whatnot.


Typosquatting in package repositories (2016) https://lwn.net/Articles/694830/

Further analysis of PyPI typosquatting (2020) https://lwn.net/Articles/834078/


This just illustrates issues with 'one-size-fits-all' approach to code distribution. Compare that with traditional C/Unix way - everyone can push code on their web, but there is no implicit trust associated with such code. Then code may or may not be incuded in separate 'distribution streams' (like Linux distributions), each has its own independent vetting process.

C/Unix way avoids centralization of power/trust and also avoids implicit assumption that every code should be in one canonical repository.


What makes this exceptionally problematic is pip's behavior of --extra-index-url, commonly used for private indices. To choose between the standard index and the extra index, pip looks at both and takes the one with the higher version. So if an attacker uploads a public package with the same name and a high version, pip will prefer that one. This has actually happened to me, although just by accident and not malintent, and I pin and sign my dependencies so it was caught before it was deployed.


Seems like a great attack vector for disgruntled ex-employees actually. All they'd need to do is copy the source code and names of internal packages, then release public versions with the same names and same source code, but backdoored. Developers likely wouldn't notice because things would function similarly at least for a while.


How about:

Packages are namespaced and "tier 2" by default.

Official packages, packages from trusted vendors, might-as-well-be-in-the-stdlib packages, otherwise vetted packages, are upgraded to "tier 1", might drop the namespace.

Project manifests can specify: stdlib only, "tier 1" packages only. The latter sounds like a sane default.


Who decides what's "Tier 1"? How will you fund their work? Currently there are a few vendors providing what you're asking for, including Anaconda (https://www.anaconda.com/) and ActiveState (https://www.activestate.com/).

> Get the most from your use of Perl, Python or Tcl and reduce your compliance, legal, and security risks. We offer custom managed and self-serve distributions, including support and maintenance, on Windows, Linux, AIX and more – even for 32-bit and older releases.


Who decides who is a "trusted vendor", and how? You either give up on a lot of packages or end up in the current situation where everyone is trusted.


That is why I only get my python packages from Debian repositories ;)


1) surely there's a better way to draw attention to this kind of risk...right? 2) I admit they have my attention, but now what?


Add namespaces? Block adding new packages that are not named "vendor-packagename" and ensure that only one entity can own a vendor name.


Causes development teams to update roles on who is allowed to do what and package signing.


Didn't this happen a few years back as well (circa 2012)?


it's a bit crazy how easy it is to upload executables


Any way to know how many times were the packages accessed?


Remember the 1990s when everyone mocked MS Outlook for downloading and running unverified software automatically?

Now Linux programmers do that intentionally via package managers.


It's not unverified if it comes from an actually curated package repository, like the one on any given Linux distro. "Package maintainer" isn't just obsolete busywork, it's an essential part of a robust security model IMO.


Linux programmers realize that's what they're doing.

Outlook users probably didn't, most of the time.


"Someone"? Why don't public package managers like PyPI & npm require package owners to publish their real names and verify their identify, similar to WHOIS?


A lot of the maintenance of these packages is performed by volunteers. Putting, however well-intentioned, barriers up will often result not in a more secure maintenance process but no maintenance at all. I'm generally happy to donate my patch that fixes an issue in a package I use, but that's not going to last long in the face of demands that I jump through hoops prove my identity (my interest in the following the bureaucratic procedures of an organization is closely related to the amount of money they are paying me). Particularly since my experience has been that organizations that are anal about that sort of confirmation tend to be more interested in passing any legal liability on to me rather than maintaining a secure maintenance process.


You misunderstand me. I don't think every single code contributor needs to divulge their identity. Only the package owner (one person) needs to vouch for the code in the package by putting their name on it. That person may have written none of the code, but they should be accountable for it. This is similar to how websites have domain owners.


I think you'll find the set of people simultaneously competent to maintain a python package and willing to take on potentially significant legal liability for free is too small to maintain a software ecosystem. This isn't necessarily bad: it means that the process of evaluating and securing code has tangible value and provides incentive to hire programmers or their consulting companies, which improves my odds of being able to feed my children in the future.


So you can sue them should their account get taken over or for other reasons? Or so that people can harass maintainers they do not like easier?

Looks like unneeded legal risks with no reward given people can already reveal their name if they want to.


The reward is improved security for the whole ecosystem because anonymous trolls will be discouraged from uploading garbage packages.


Someone"? Why don't public package managers like PyPI & npm require package owners to publish their real names and verify their identify, similar to WHOIS?

If you want a curated Python there’s Anaconda and ActiveState.


Recent trends in WHOIS (in Europe, at least) is to not show people’s names. People both deserve and need to be able to be anonymous:

https://geekfeminism.wikia.org/wiki/Who_is_harmed_by_a_%22Re...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: