Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
All of us test in production all the time (2019) (increment.com)
198 points by jrvarela56 on July 19, 2020 | hide | past | favorite | 156 comments


So it's a fair point that the fidelity of non-prod environments are inherently limited and you still need a bunch of other stuff like canaries, automated canary analysis, automated rollbacks, zone fault-tolerance, feature-flags, chaos engineering, server and client-side instrumentation, but this is generally 'next-level' stuff when most shops aren't even getting the basics right.

For the overwhelming majority of non-prod environments and release processes I've seen they usually have basic problems and would catch a lot more issues with additional engineering investment. I.e Releases aren't automated, self-contained, or idempotent. Unit testing coverage is bad, integration tests are non-existent, data used isn't reflective of reality, there is no performance or load testing done, downstream or upstream systems have poorly defined interfaces and are excessively mocked or just not considered.


Wholeheartedly agree to everything you’ve said.

I have worked at a lot of places where automated testing is almost non existent. Instead they use staging/testing environments and manual testing.

For all these projects they frequently have downtime or errors in prod where something was uncaught. It takes ages to release a fix because they have to carefully research how a change affects other code instead of running a test suite.

They are convinced that unit tests are unnecessary extra work for their project but they don’t realize they are losing time by having to manually test and research changes. I think a lot of the focus on fancy testing practices has lead some programmers to think testing is a load of extra work when things will fail in prod anyway. We need to push back and say automated tests are a way to codify assumptions about the code and help speed up the prototype-build-test-release loop.

Side-note: self documenting code is a myth but I digress.


> They are convinced that unit tests are unnecessary extra work for their project but they don’t realize they are losing time by having to manually test and research changes.

If people are finding tests unnecessary extra work, maybe they're right for their project.

Most jobs are just "ship feature asap, try not to break anything, repeat". Most managers don't care about code quality as much as speed of delivery. Most coders don't have much "skin in the game" and can always get another job when the project fails.

Writing tests is a sign that you know what you're building, you have a methodology for building it, and you have a commitment to keeping it working in the future. That's simply not the requirements for most programming work, and it can be a disservice to both the coders and the management to push for testing that won't help achieve the business goals. If a business needs to iterate fast and reliably, write tests. If a business needs to iterate fast and is ok with breaking things, skip the tests and have a rollback plan instead.


Sorry, I have to completely disagree with this. This is a horrible, but often used, excuse.

In my experience these coders are doing exactly what you would do in a unit test but instead they are doing it manually by adding printlns and testing various values on the command line, in a repl, a notebook, or a main. Instead of manually adding these things every time you want to test your work, if you just added some basic unit tests it would speed up your delivery. Not only that but you can't "forget to test that" because you're forced to write down the test cases instead of deleting them when you're done.

If you add up all the time you spent adding printlns to your code, testing various input values in a main method, researching inside your codebase to see what could fail, debugging your ETL because you forgot about an uncaught error or a different kind of null in this column that only came up after it processed 3TB of data (real example), etc. I guarantee you would save time if you just implement some basic automated tests instead of testing in prod or staging.


I don't disagree, probably.

I keep finding myself working on internal projects that are driven by a UI and have little business logic but lots of I/O to other systems. With no hard requirements, no guarantee of shipping and no direct access to end users. Which I find less suited to unit tests than (automated) integration testing. As soon as you put unit testing in all anyone can ask is what your test coverage percentage is and why it's some value less than 100.

Unit testing is a dev tool, if a dev sees value in writing and maintaining a given test, great, otherwise it could just be that it's not a panacea. Let them strike a balance but ensure the team ask the question of any production issues - could a unit test have prevented this?


Yeah no.

The biggest issue in all my dev work so far was a really shitty codebase which would work somehow and had and still have hidden big issues going on.

You just don't see it today perhaps and delivered it in 1 day instead of 3. But you will fix the issue in production half a year later and either have lost data, corrupted data or you have to correct data.

Not doing any tests, is literally shit.

Missing Unit tests also often means no proper CI/CD pipeline. Because yeah why would you need a proper build if you don't even execute tests anyway right?

IT IS always a clearly ignored requirement of doing proper work. Quality work. Do less, do better. Its the fault of the manager to not think about it and enforce it and its the fault of the developer to build shit.


Managers need developers to write tests so that when that developer job hops the new dev on the team can actually make changes with confidence. As it happens, writing tests and adhering to best practices has a side effect of keeping turnover low because developers can do their jobs with less friction.


Then, may I present the argument that, in general, you should be testing if for one reason: to get practice writing tests for when you actually have a project that you need to write tests in (edit: according to your metric).


Self-documenting code is all code. The only thing that speaks to what a piece of code does is the code. Comments aren’t syntax, so they don’t help you there. I’ve never seen a codebase that does a good job of documenting the system architecture, expected usage patterns, and why the code is built the way it is. And I don’t know that having all that be inline to the code would be helpful anyway.

I find it far less error prone to just read the code and determine its purpose and allowable modifications from that. I am thinking about what a citation or footnote system might look like for code now though. That has the potential to provide context on-demand without needing to separate out all the pieces of the code into little silos that are hard to piece together.


Self-documenting code is a total myth. Instead of the author documenting what a piece of code does, why it does it that way and how to use it everyone who works on that code has to spend hours looking through the commit history, merge requests, design docs, wiki history, asking around etc piecing together the history and conversations around the code.

Sometimes these artifacts don’t exist so then you have to go do the same research the author did to write the code just to figure out why the code exists...

Note that none of those activities are reading the code itself because you can do that in both scenarios.


Agreed

I like the view i picked up somewhere but don't remember from whom: Tests are not there to prove something is working, but to prove something is not working.

They are a fairly cheap way to tell me during development that this ain't going to work no matter what. And if everything compiles and the tests and reviews pass, I think this has a good shot at working out in production, but there still might well be cases and circumstances I haven't or coulnd't have considered.

No setup will cover 100%, but with not that much work you can already get quite far. The tricky thing in my opinions is more on deciding which kind of tests, processes and automations are actually worth implementing and maintaining at a given stage of a project.


Glenford Myers in "The Art of Software Testing" said: "Testing is the process of executing a program with the intent of finding errors."

The preface to this is even more enlightening:

"When you test a program, you want to add some value to it. Adding value through testing means raising the quality or reliability of the program. Raising the reliability of the program means finding and removing errors.

Therefore, don’t test a program to show that it works; rather, you should start with the assumption that the program contains errors (a valid assumption for almost any program) and then test the program to find as many of the errors as possible."


I like this quote from Kent Beck:

“ It is impossible to test absolutely everything, without the tests being as complicated and error-prone as the code. It is suicide to test nothing (in this sense of isolated, automatic tests). So, of all the things you can imagine testing, what should you test?

You should test things that might break. If code is so simple that it can't possibly break, and you measure that the code in question doesn't actually break in practice, then you shouldn't write a test for it...

Testing is a bet. The bet pays off when your expectations are violated [when a test that you expect to pass fails, or when a test that you expect to fail passes]... So, if you could, you would only write those tests that pay off. Since you can't know which tests would pay off (if you did, then you would already know and you wouldn't be learning anything), you write tests that might pay off. As you test, you reflect on which kinds of tests tend to pay off and which don't, and you write more of the ones that do pay off, and fewer of the ones that don't.”

Source: https://softwareengineering.stackexchange.com/a/244709


When I’ve gone full TDD on projects, I’m always surprised to find there’s a power law distribution or something on failing tests. Most of the tests I write never catch a bug in the life of the software. Something like 90% of the value of a test suite could be achieved with only about 10% of the tests. Of course, the trick is figuring out which tests are going to repeatedly fail ahead of time. But there’s something interesting down this rabbit hole.

I wonder if it would be worth collecting per-test stats through the life of a project to explore this.


You'd get a lot of the value by just being willing to delete tests once it's clear they're not serving any purpose. Sadly a lot of people would rather see a high test coverage percentage than have an effective test suite.


What is the value of deleting a test once it's written, though?


Tests impose a maintenance burden like any other lines of code. Just making it harder to navigate to relevant code (including useful tests) is a significant cost.


You’ve already paid the cost for writing the test, might as well keep it.

If a test fails when modifying the codebase or adding a feature or updating dependencies, it still serves a purpose in my opinion even if fixing it takes some time.

Having the test no matter how trivial also helps boost confidence that a new change didn’t break anything no matter how trivial.

But tests should be refactored once in a while, and if a thing is complicated to setup for testing, either the thing to test should be refactored before shipping and/or the developer working on that thing should also ship a factory and helper functions for testing that thing.


> If a test fails when modifying the codebase or adding a feature or updating dependencies, it still serves a purpose in my opinion even if fixing it takes some time.

> Having the test no matter how trivial also helps boost confidence that a new change didn’t break anything no matter how trivial.

If it catches an error in your change, that's a benefit. If it delays a correct change with false positives, that's a cost. (And if a passing test "boosts confidence" but the change is actually broken, that's another, more subtle cost). Often the cost is larger than the benefit.

> But tests should be refactored once in a while, and if a thing is complicated to setup for testing, either the thing to test should be refactored before shipping and/or the developer working on that thing should also ship a factory and helper functions for testing that thing.

That's all valid assuming the tests are providing some value. But just like with any code, the first question should be whether it's serving a useful purpose at all - if not, then deleting is cheaper and more effective than any amount of careful refactoring.


It seems likely that some of the comments here are reacting to the headline as opposed to reading the article - which is one of my favourite pieces of writing on the subject of observability and responsible maintenance of large scale systems.

Lots of great quotable bits in here. Here's just one for people who didn't make it to the bottom:

"There’s a lot of daylight between just throwing your code over the wall and waiting to get paged and having alert eyes on your code as it’s shipped, watching your instrumentation, and actively flexing the new features."


What would be a more accurate and neutral title?


Maybe "Once you deploy, you aren’t testing code anymore, you’re testing systems"? Trying to lift something out of the article instead of writing a new title.


I saw that too, but I think it's a bit too obscure as a title.


We all test in prod


Ok, there's actually a phrase from the article that says that, so we'll put that up there and see what happens.


For us curious latecomers, what was the original title?


"I test in prod (2019)"


Apropos of this in the context of hiring, one of the biggest green flags I'll attach to someone's CV is any combination of infrastructure/operations, and application development, at the same job.

Since the 1990s I've worked around ISP, hosting, and cloud firms. Many have a core of general purpose people that can't help themselves but have one foot in both graves (we call this DevOps or SRE now, but those are new labels for a long-standing viewpoint). This often correlates to someone who embodies the mindset in the article, viz. that they they will gladly and actively defend any ditch we dig together. They always get on my interview list (the green flag). Very often the interviews are a wide-ranging, free-flowing, and in-depth discussion of multilateral & cross-functional technology/process interactions.

Many of the people I've met with this combination will progress, either immediately or eventually, to become very effective CTOs, tech co-founders, or the highest levels of IC at larger tech firms.

Interestingly, and I say this purely anecdotally because I am not actually qualified to make the diagnosis, some of them also appear to me to have an attentional difference, or present from an unprivileged background, and may not have followed a standard educational path as a result of either. Which is to say that I usually delete "must have bachelors degree" from any JD that HR ask me to authorize.


Wow, you just described me. Except I'm about 4 years into my career. Glad to hear that I have a type. It is oddly reassuring in a manner that I can't currently articulate.

Most places I apply I just don't hear back from. Places I have worked had no idea how to use me effectively. I've applied for very few jobs because of this (on the spectrum, have trouble with rejection). I do a lot of free or underpaid work or just do something that fascinates me. The poverty sucks but I love that I get to pursue so many interesting things.

I have faith that I will find somewhere that is a good fit one day. In the meantime, I get to explore my interests and grow. Thanks for this post, it gave my hope a much needed boost.


Your self-description aligns very much with my own. Feel like we might get along well. Would love to chat. Shoot me a PM if you're willing.

Cheers, mate. There definitely is a place for people like us in the world.


>Nobody invests in their “test in prod” tooling.

Firstly, what is logging then?

How is this not tooling to ensure things are running smoothly? `less +F` anyone?

Secondly, if you're running an aws/azure/gcp based server you now have a ridiculous amount of tooling for production diag, analytics and tracing.


"Firstly, what is logging then?"

"Secondly, if you're running an aws/azure/gcp based server you now have a ridiculous amount of tooling for production diag, analytics and tracing."

This presume that you, the person expected to fix problem X actually have access to logs and the servers where problems are happening. I've been at multiple projects/companies where this simply Isn't Allowed(tm).

"You wrote this code, you need to fix it! It's broken!"

"Let me get on the server and take a look at the logs to see what's going on".

"That violates our security policy! You can't do that!"

I was tasked with 'finding a problem' and was told to look in the logs. We knew what day the problem was, but I couldn't get anyone to confirm if the log files were in UTC or something else (turns out it was something else).

HOWEVER, I never actually got the actual log files. I didn't have access directly (that takes about a week to go through the chain of command to approve), so someone just sent me small snippets from where they thought the problem might be.

So even orgs that tick all their checkboxes of stats/analytics/logging... sometimes seem to forget that access to the collected info is a requirement too.


Feel your pain <3 You can tell it's been a few years now but I have worked at Big Corp Inc. before.

I know what it's like to ask someone in charge of 8k people to personally approve my server permission escalation even though he's never met me, doesn't know anything about the IT systems, has never visited the site I work at or probably even been in the country I reside.

Anyway, a good thing on this front now is that (speaking from Azure exp.), you can dump logs directly to blob storage (s3, whatever google call their data storage). Then you only need that permission.

As for the non-UTC servers... -_- but if it's a problem, you can always append the tz info to the log date format.


The problem that the above user had was that the logs were already written without tz info, so they had no way to know. Assuming it was set to the machine's tz producing the logs, you could probably figure out the tz from the machine's ip address (assuming that's produced in the logs as well..)


what I learned was that the logs are timestamped to whatever machine they're on, wherever it is. The servers in London - GMT. The servers in west coast - Pacfic time. Servers in DC - Eastern time. Not ideal.


Right - assuming the server IP, or at least some other network identifying factor, is in the log, you could write some sort of regex to parse the logs and identify the correct time. Of course, that depends on someone being able to actually parse the logs with your regex


I once found a solution to determining the timezone when it wasn't available - a field with only a date was unnecessarily and incorrectly converted to UT, meaning that the zone could be derived from the offset from midnight and applied to other fields.

If enough things are screwed up, you can possibly find the solution to one issue in another.


Logging is tooling but it is the bare minimum and requires very little investment to do. Good logging requires a little more investment but often isn’t done. However, it still generally only provides insight into the known knowns and sometimes a known unknown, which is largely not where bugs happen. Better tooling would enable understanding and further aid reproducibility over mere observability.


Would a better word be "validating" in production?

I try to validate all of my changes in production, see them run, see them output expected values, etc.

I try very hard to test everything I can prior to production; but, then I read loglines, watch graphs or run the code myself, all to validate that all the interconnected pieces are working together.


Most places that I've worked, that "test in prod", do so for one reason. It all comes down to laziness.

The first scenario is they have never set up a test environment in the first place. They're either too lazy to do so, or too lazy to look into how to do it. Often confused with being 'too busy to do it'.

The second scenario is that they do have a testing environment-however, for some reason, it's broken. Some change was made, usually in a hurry to meet a deadline, and it's been broken ever since. This one is usually a result of the fix being too difficult, because the test environment was haphazardly thrown together in the first place.


> The first scenario is they have never set up a test environment in the first place. They're either too lazy to do so, or too lazy to look into how to do it. Often confused with being 'too busy to do it'.

One lesson of modern architectures (i.e. anything more recent than the LiveJournal-style Web/App/DB 3-tier stack) is that it is literally impossible to create and maintain a test environment that has enough similarity to prod to be useful.


As QA I think testing, test environments, test ressources should be first class concepts in software. Building whatever "modern" architecture twice, and simulating traffic is not easy but possible, at least to an extent. You need to get that extra license for any piece of software for testing during procurement, and provide means to create test ressources, like typically test users as needed. Sounds trivial, but is often so complicated (banks, insurances...) that it should have been considered early during design.

So, I wouldn't use the word "lazy", but we could do better.


It's not the architecture, it's the representing all the states in prod in testing.

Not saying it's impossible, just not always feasible to have it actively "perfect".

I've found that an "ok" testing envrionment and solid unit testing / monitoring tends to do better.

And then b/g deploys and canaries :/


Modern architectures look like this: https://github.com/donnemartin/system-design-primer

Building this out twice is either actually impossible, or so cost prohibitive as to be practically infeasible.


> Modern architectures look like this:

A lot of modern architecture is not like that.

And even the diagram there is something should be possible to replicate if the organization values it.

Today it doesn't even need to be too expensive as it can be deployed with terraform and torn down an hour later when the full system tests are finished qnd go back to running the integration tests.


It's not possible to replicate the CDN, or DNS, or the message queue if it's hosted, or the database of it's big enough, or etc. etc. The differences between what you're able to create in a staging environment and what exists in prod are significant enough that testing against the staging environment doesn't build much more confidence than testing with mocks.

Testing in prod means accepting this reality, avoiding the largely unproductive toil of building and maintaining a staging env, and making the production system observable, resilient, and operationally agile enough that you can deploy changes that can trigger unpredictable and emergent system behaviors while managing risk appropriately.


Not sure where you are working but not a lot of people work at/on a huge code base.

Most codebases are build and operated by 1-3 teams and if you are not able to reproduce your prod env as a test env, your architecture is broken.


The size of the codebase is orthogonal to the size/complexity of the production environment.

Modern architectures look like this: https://github.com/donnemartin/system-design-primer

It is generally not feasible to recreate all of the elements of this design in a totally separate testing environment.


I know this very well and i'm stlil not seeing an issue.

We have 2 GCP accounts. One for production and one for test. Same Setup, not an issue due to terraform and the cost is reasonable for a test env.


Same network rules and configuration? Same CDN settings? Same DNS? Same data in the databases? Same volume of traffic on the queues?

(Rhetorical questions. Of course the answer is no.)

All of these things have huge impact on the actual similarity of the systems, and consequently how they behave.


Obviously but this is not a black and white thing.

If it wouldn't make sense for us to have test and prod, we wouldn't do it but i can't imagine a scenario where this is NOT beneficial.


To be useful? That seems to be an extraordinary low bar to hurdle for a QA environment. I imagine most clear that bar easily. I know mine does.


Can you explain that more? Aren’t most of the complex modern tools specifically designed to automate provisioning servers and deploying code?


Modern architectures look like this: https://github.com/donnemartin/system-design-primer

All of the pieces have complex and emergent behaviors as they interact with each other. Many (most?) are hosted, not even fully in control of the team using them. Recreating this environment in a hermetic way for testing is either actually impossible, or so cost prohibitive as to be practically infeasible.


I'm not the person you're responding to, but having the same software and hardware is possible (most of the time), but production environment is a living thing. To simulate production you need the same data and interaction you have in production, and that's quite difficult to do in many systems, if only for regulatory reasons (PII, PCI and friends).


Once upon a time I was a tester. Now I am a sys admin/DevOps engineer (what is a name, really?).

You would not believe the amount of bugs I would find from both vendor code and internal code. Bugs that require dynamic execution and context-spexific scenarios to reveal themselves. Then you had misinterpretations of requirements, which is not something the developers would find, considering they wrote the code and believed it met the requirements.

I worked for the police force of my region as a software tester for all their critical systems. I once developed a test tool that created packets to mimic mobile phone calls to police dispatch. I tested every combination of the data spec, including the very last one, where there was a critical bug. The vendor code had implemented the earlier version of the spec, not the latest. The vendor's test tool created the wrong packets as it targeted the wrong spec. Mine caught the bug cause it creates the correct packets.

As much as I agree with the author's ideas of familiarity with Production, proper unbiased and independent testing is still something this industry needs, even if it's unfashionable.


Most large companies encourage testing in prod. It's called split test. Clearly precursor to that is staging, and precursor to that is integration tests and whatnot. But none of that catches what you can with a/b test, and great monitoring.


Exactly what I was thinking, but I think that a/b test "feature" is mostly framed (and thus used?) for BI purposes as opposed to "let's see if this breaks".

Don't get me wrong, where I work they definitely are open to making mistakes in production. We roll out "risky" new features with split tests and I think we use that logic well so this notion obviously exists, but I'm not sure how widespread it is.


I have seen google and Facebook first hand. The practice is widespread, but the acceptance of "let's test it in prod" as a real stage in software development varies. Some people look at you like you are crazy person that should not be trusted, some people acknowledge and know it to be true.

Feature testing is one thing, but you can go a/b testing a whole binary between new and old version to catch interesting bugs too.


Do you mean acceptance at FB/G? Or acceptance at companies with less mature engineering?


Both. Big companies are not homogenous, so the practices also vary among orgs.


Well no, but actually yes.

Kind of like how you don't get requirements from talking to customers or "the business". You get requirements from demoing. If your first demo is your prod deploy, then guess what, you've just deployed a proof of concept. It's a de facto thing that just happens even if you don't plan on it. In fact it happens BECAUSE you don't plan on it.

More to this point: there are many different kinds of tests and what we are most commonly referring to is "regression tests". There are lots of bugs that are not due to regressions.


"You get requirements from demoing."

I really like that - what an insightful observation, neatly condensed into a few words.


Headline is unfortunate, and then the first part of the essay is spent justifying/walking back the title. The good part starts a few paragraphs down, at "We conduct experiments in risk management every single day".


A lot of this boils down to the constant churn in frameworks, instrumentation, programming languages. Developers don't have time to master one way of doing things, it's just a constant lava layer of crap.

I took the time to learn how to build and maintain Ruby on Rails systems, hoping it would be the ticket to a fun, manageable career. Any project I worked on was an island of sustainable, fast development where the team finished the sprint's work in a few days. I knew where the bottlenecks were should the project need to scale, and scaling issues never took down the site completely.

Only to throw all that expertise away when it was just decided to replatform one day. Because I guess stable just wasn't good enough. After it happened twice, well, might as well go into devops. Y'all are gonna need it.


I had the same issue with Rails as I have with Node : the churn is detriment to long standing projects. Most projects I do run for years and even decades; you go run an update for security for Rails code that is 10 years old. It is a nightmare. I did not have issues like that with php, asp.net or spring.


I loathe NodeJS with the passion of a thousand suns. When Rails projects churn, at least you can rely on the stability of Ruby. With Node, the number of times I had to work around inconsistencies with the framework and even language were maddening. Babel is a horrible horrible thing, and Javascript was never meant to run on a server.

Maybe it's better now. But I won't ever do serious work again with the stack so I'll never find out.


Curiously enough JavaScript was in fact designed to run on the server too, Netscape LiveWire shipped it in 1995.


Node is getting better each year imo. I've been pretty happy with ES6.

What makes you say Babel is horrible? Not disagreeing but I haven't heard anyone say that before.


I write JavaScript daily and my gripe with Babel (and transpilers in general) in the JS ecosystem is that it makes it easy to have 10 dependencies written in 10 similar-but-different languages. If you have to edit a dependency then you're stuck editing their source file (written in ES6-ES11 plus random unaccepted proposals / TypeScript / Coffeescript / Purescript / etc), installing their devDeps tree to run their build process, integrating their sourcemaps to un-break your stack traces, and writing code in a slightly different language each time you make a change. It's exhausting.

It's really neat that we have the option to transpile, and I was actively working on transpilers in the Meteor ecosystem when they were all the rage, but at this point I try to avoid them whenever I can.


People don't say it directly, but all the constant complaining about the bloat and complexity of the JS ecosystem and the size of node_modules folder boils down mostly to two tools: Babel and Webpack.


When people make fun of leftpad they are not talking about babel and webpack.

The bloat is a lot about the lack of a standard library in JS. Or rather, too many of the damn things. That and an inconsistent and unpredictable type system.


> The bloat is a lot about the lack of a standard library in JS.

Which is why someone created core-js, which a bunch of libraries use but, in my experience, never update. Every React project will have multiple libraries complaining that core-js 2 is deprecated.

So basically even the "fix" has all those problems.

> That and an inconsistent and unpredictable type system.

Agreed and honestly, I don't think TypeScript's solution for third party types is any better, either. I recently ended up in a situation where `some-library` and `@types/some-library` got out of sync and I simply had to try versions of `@types/some-library` until it would compile again. This happened because typedef versions are not pinned to library version in any way, they are versioned just like any other library, and to the best of my knowledge you can't simply look up this information to figure out which @types release you need for a library. You can only enumerate released versions and hope the latest of each work fine. Was there a breaking change that the typedefs haven't accounted for yet? Well guess what, you can't update that library, yet, even if there were security fixes. Your code won't compile.

This means you could potentially have `some-library` at version 1.2.1 and `@types/some-library` at version 1.1.0. There's no relationship there. This happens the moment there's a revision in a library that doesn't require a revision in the typedefs.

Sorry to ramble, I'm just really unhappy with the state of our two major ES languages.


Babel was the largest package that was dependent on leftpad back in 2015 when the leftpad issue happened.

The package bloat issue doesn't happen with Typescript, for instance, which occupies a similar niche in the ecosystem (it's also a transpiler), or prettier.


But that is a sign is it not? If tools as badly designed as webpack is the standard, is that not saying there is an issue with the entire ecosystem?


There are other standard tools that don't suffer with package bloat, like Prettier and Typescript, which don't have dependencies. This issue is not inherent to the ecosystem.


I don't think it's the tools' fault. They were built to solve problems, and they solve those problems quite well IMO.


Sure, but I don't think those things are related. It's possible to solve problems without causing package bloat. Typescript and Prettier have no dependencies, for instance.


This is the primary reason why I love the .NET stack so much. Not only is a security update almost universally a non-event, but so are major framework revisions.

Updating .NET applications for security concerns in many cases boils down to letting windows update do the thing it should already be doing automatically. In a much smaller portion of cases for .NET Core, a security update is a matter of rebuilding a self-contained deployment and pushing it out to production. Neither of these cases requires editing any code or configuration.

Only in the transition from .NET Framework to .NET Core did we find any appreciable difficulty, and even then it was still within the realm of reasonable. Most of the pain here boiled down to usages of System.Drawing and DirectoryServices, both of which have alternatives that are supported on all platforms. We opted to use the compatibility pack instead, but would be OK if that was removed at some point and had to replace with the other implementations.

We have lots of sourcecode that has withstood the test of time because of these technology choices. Many files are completely unchanged since .NET Framework 4.5 was released and now running flawlessly on top of .NET Core 3.1. This is the only reason we are still in business right now.

I ended my journey with NodeJS (and AngularJS on the same day) when I attempted to update all of my packages after just a month of inactivity. Seeing the resulting pile of nonsensical dependency graph trash in my console may have radicalized me into using more "enterprisey" solutions when writing software that I might get put on pager duty for.


My experience has been the opposite. The Rails ecosystem makes updates straight forward and is very well documented. Been running various Rails apps for almost a decade now.


It's pretty good for incremental updates. It only asks a few hours a year to manage those.

Not so good when you want to upgrade a 5-year-old system. I inherited a reasonably well-built Rails app deployed with Ruby 2.2.1 and would like to update it, if only for security reasons. I can't even run that on a recent Ubuntu. On a system where I can run it, I run into dependency hell with the gems - need to upgrade some, pin lower versions of others. Possibly that could have been handled with pinned dependency versions the first time round, but that breaks the regular upgrade process.


That is indeed what I meant; I had to create debootstraps to even get it running on my laptop. And many of the gem updates broke the apis completely for no reason (move fast and don’t care?) I do not know: I remember a security update where the new version renamed most of the api methods while there was no reason to do so. I understand you can add new methods to keep backward compat; here they actually renamed functions so it was not backwards compatible and you had to change everything.

Simply put; we can cry about java or c# or php (raw php: not laravel; it has the same issues) but I can run 15 year old code on my current Ubuntu and it works fine. Maybe I need to change a few things but I do not have to rewrite a lot of it; it is more find/replace than anything. The feeling that software has to work in 50 years is important. Incremental updates are expensive and not needed for most companies, so software needs to support major version updates over decades. We know that now...


Straight-forward and documented is the minimum. Not having to think about it at all ever would be ideal. At some (huge) scale it will eventually become a thing that takes a year and a half and gets blog posts written about it, and I never enjoy losing a weekend to a microcosm of that: https://github.blog/2018-09-28-upgrading-github-from-rails-3...


> After it happened twice, well, might as well go into devops.

My observation of the devops world is that the rotation of technologies and frameworks there is faster than in backend development.


Yes, but there at least is some genuine innovation going on. The backend frameworks are just churn for solving the same boring and lucrative business problems in new, fashionable ways.


True but imho its somewhat of a fan-out or at the very least knock-on effect. The explosion in devops complexity is a necessary byproduct of the explosion of projects that adopt new languages, runtimes, datastores, etc. No one ever fully deletes and retires the old system, so the overall environmental complexity grows exponentially. The tools it takes to solve those exponentially larger problems (k8s) look ridiculous when you consider them in the context of "a rails app with some node stuff". But they make perfect sense when you say "every quarter or two, for a decade, a dev team has adopted a new component or two into the stack. less than three of which have ever been deleted."


If you like Ruby on Rails so much, why don’t you find a job programming Ruby on Rails? There are so many.


My experience is that nobody actually does things The Rails Way, which is where most of the benefits come in. No, they’ll start out in Rails such that their underlying data model evolves into a huge mess (because Rails guards against most problems in the code layer, so the data model problems don’t really appear as bugs yet), and then some resume-driven developer will push for serverless or microservices to solve scalability problems (when the actual bottleneck is the database and the data model, not Rails), and now you’re left managing this Frankensteinish “rails” SPA beast that could be rewritten as a proper database schema and a 5K LOC Rails app, but there’s too much change control process to actually achieve that.


The biggest waste of time I've ever seen in my career was when a team put all their backend efforts into migrating their app from The Rails Way to Services, then from Services to Trailblazer, then from Trailblazer to ActiveInteraction. All in the span of two years and without delivering a finished product.

In the end the company just bought a competitor and gave up on the app.

100% agree that doing things The Rails Way is where you reap the benefits.


Rails does so much of the plumbing work for you (and does it really well) that it’s a good smell test that if you’re not working on something that the user sees, you’re probably reinventing the wheel.


After it happened twice, I just gave in to the inevitable and decided to value jobs based on other factors.


He mentioned that he had a job doing that, but the company moved away from the framework without technical justification.

He could find another, but there would be no guarantee that the same thing wouldn't happen again.


> the company moved away from the framework without technical justification

I think this alone is enough justification to look for a new place to work, but I’m just a job hopping millennial.


I did. Asked to be moved to a different client. Still there.


> I guess stable just wasn't good enough

I suspect that's meant to be rhetorical, but from a business perspective, was stable good enough? Sometimes what works now won't be competitive in the near future.


Take a new site for example

Its just text and pictures on a web page. The new york times re-invented a database out of kafka to power theirs (mind bogglingly stupid)

The Guardian re-wrote their CMS at least 4 times, at a cost of well over 50 million, if not more. Thats before they got to re-doing the layout.

The FT spent 6(!) years rebuilding their entire stack, (150 tech staff for 6 years. a good percentage contractors. > £70million)

All of it is just fucking text on a fucking web page. All that effort to add fancy gizmos for hosting, optimising for framework x, which has a half life of 6 months, adding animation to drop downs. Just a spectacular waste. They are still re-platforming every 9 months, each of the 5 major teams.

All of that money could have been put into content, advertising for customers, collaborations & events.

Most news sites can get away with a static, highly cached system (I mean look at the Daily mail, the page is slow, static and looks like shit, yet its the biggest new paper site online)

So form a business perceptive, its spaffing money up the wall for no real gain.


I wish I could upvote this a thousand times. At the end of the day, it's all ego driven development. Nobody wants to think that their job has already been solved, and what's left is mundane and boring [engineering wise]. And so the business ends up with using a distributed log to handle a data set that could fit on a big thumb drive.


can confirm. media companies spent the last decade alternating between redesigns and cms migrations in a circular path that went nowhere while fb & goog ate not just their lunch but their breakfast and dinner too. the vast majority of it was driven by engineering-management-career resume-building and was actively detrimental to the editorial/content-production/journalism side of the house.


> [T]he vast majority of it was driven by engineering-management-career resume-building and was actively detrimental to the editorial/content-production/journalism side of the house.

Why didn't the non-technology side prevent this if it was so clearly detrimental? I can't imagine that the technology side of a media concern has that much influence over the company's overall priorities.


because editorial say: "we want the website to do x" the tech team say ok we'll need 100 staff, and six months.

then, after delivery, the tech team "we need 50% more staff to do this thing, its really good, it'll save 25% in our hosting costs"

Editorial then say: "sure, deliver this feature as well"

This continues, and then deadlines are missed. So tech say: "We have so much tech debt we need to re-architect" which allows ego/CV driven design.


I've tried to reply to this three times and its hard to without writing a book.

The shortest possible answer is that investors/owners/boards were watching what happened to sites like reddit/huffpo/bi/instagram/tumblr and salivating out of control at the word "billion". They believed that if they starved their editorial operations and bet most of the cashflow on growing their tech teams that "user generated content" would make up the difference. It was very simple math that said "media companies are valued at 10x profits, tech companies are valued at 10x revenue, lets do whatever we can to give the appearance of being a tech company not a media one".

Within those companies tech departments it became open season for CTOs, VPEngs, DevOps Directors and sr/lead/architect devs to engage in a peer-to-peer-ddos of architectural and tech-fad-chasing ones-up-manship. Because the only penalty for it all turning out to be too much was having to grow your headcount.

In 2016/2017 it all finally collapsed. Most of the companies have faded to layoffs and/or acquisitions. Many have basically frozen and given up on their tech stacks in-place. A large number have abandoned ship to wordpress (where imho they should have been all along).

Sometimes I wonder what the general state of "the media" or "journalism" would be now if 80% of that money had gone to hiring content creators, or even just paying the same number of content creators the 30% more it would take for it to be an adult job not a early-20-something job. An awful lot of the salaciousness and outrage-stoking in media right now is a byproduct of the actual jobs being very young people sitting in place all day garnering all their information from social media feeds and trying to hit their 2 - 5 posts (sorry 'articles') a day quota. We all think we're critiquing journalistic institutions when we point it out but really we're just yelling at kids. Blaming them for not being able to achieve the standards we're used to with 3 hour to deadline in their permalance gig, when what we got used to was written by mid-career people with benefits who had all day and sometimes week.

Media isn't stoking outrage culture, its clinging to it as the last source of pennies left before the nothing.


> Sometimes what works now won't be competitive in the near future.

What do you mean by that? I can understand in terms of hiring devs, this is true, but any other reason I would be curious about. Because I work with/in many companies, I see a lot of different tech and competitive is usually just what the team knows best. So unless you are making a new team, I cannot phantom why you would switch.

Java with Servlets/Spring and C# with asp.net have been competitive for almost 2 decades; maybe some people can shave off a few minutes here and there but in the grand scheme of a serious company with a serious project (500k loc+) it is not going to matter. There are exceptions as for instance machine learning but those gaps close and this was about web frameworks.


Perhaps he means that you won’t be able to attract great talent. A lot of extremely talented engineers don’t want to work on code bases and technologies that are 15 years old.


Definitely would like a source for this because I feel like I hear it a lot and I don't believe it's true.

You can be talented and not care what you work on. And if all you care about is what tech is new and shiny, how talented can you be? I don't want you rewriting my product every year so you can use the next framework du jour.

Maybe talent means something else when nothing is at stake, but I'd argue that a talented engineer knows that when a product works and makes money, and the tech stack isn't prohibitive of those goals, it doesn't really matter what it's written in. In fact, I see trend-chasing developers as liabilities, because they're the ones who will replace everything every six months while you bleed money.


The issue is rarely the age of the codebase or the tech stack, but rather the maintainability and quality of the code.

Due to many factors, the age of the codebase is normally inversely proportional to its code quality. Especially codebases in fields where trends change fast, like web technology.

Therefore, if you're a talented engineer able to work in multiple technologies you'll certainly prefer to work on something that has less chances of making you want to tear your hair out.


> the age of the codebase is normally inversely proportional to its code quality.

What??? You are saying that simply because a code base is old, it is of poor quality. There are numerous public examples contradicting this: Linux, many Apache projects such as httpd, etc.

I emphatically disagree and would like to know why you think that.


I said "normally". I don't think anyone considers Linux or httpd "old tech", neither they are "legacy software" that experienced developers are running away from. C might be old as a language, but there are still modern things being built with it. I also said that the issue is not the age itself or the tech stack.


You didn't use the word legacy in your post. You used the word old. Quite different. Something can be old and still maintained (httpd, etc).


> they're the ones who will replace everything every six months while you bleed money.

Yep. I agree. Maybe talent is the wrong word, but my point is that a lot of experienced engineers want to learn new things by working on new things.

There is a difference between talent and wisdom.

My source? Decades of experience in the industry working with hundreds of different engineers. So, anecdotal.

Not everything needs a double-blind research study to be truthful.


both java and .net (core) are evolving and more mature, though


I love articles like this because it's so easy to just add that company to a list of places to never ever work.

I did read the whole article, btw. It's an absolute clickbait title that the author doesn't really mean, and after the article spends a lot of time diffusing the clickbait title it really boils down to, "This is hard, so I give up."

It's true that many--if not most--companies operate this way without ever acknowledging it. And that's bad. It's also true that systems are harder to test than code. But it's not deep fucking magic. Look at the work aphyr does. Look at the testing work that the FoundationDB team did to prove their system's guarantees. Look at the work that security and devops people do every day. She is right that it is hard to test systems. So what? We don't get paid as much as we do because it's easy.

In a certain environment, it is truly impossible to test a system. That's when you have a dev culture that refuses to actually design knowable systems. A much better approach for the article would be to address exactly why systems are so hard to test rather than just saying fuck it. Everything she cites in her list of things that are hard to test are absolutely testable, if you have a knowable system. The real problem here is that agile/scrum/Xtreme programming practices inevitably and by principle do not result in knowable, testable systems. When you have 30+ agile teams on their own sprint cycles and product managers leaning on them to ship features and figure the rest out later, there can be no other result than fragile, broken, unknowable, untestable system.

But the answer to that isn't "Everybody else is doing it so why can't I." The answer isn't to "embrace it." The answer isn't "This is hard, fuck it." The answer is most definitely not to make individual engineers pay the price of being on call because a company's culture and process are totally and completely hosed.

The answer is to address the problems in your company that caused this situation in the first place. The answer is to get your head out of the feature cult and the velocity war and reset your priorities. Systems aren't hard because your engineers suck. They're hard because companies suck. Systems are hard because in most places, no one is allowed to spend more than a couple minutes thinking about the systems.

Agile culture after your early startup cycle is a lot like being a 40 year old guy who's 30 lbs over weight. How did this happen? How did I get here? I was just taking life one thing at a time and getting shit done. Now nothing works quite as well as it used to, it's harder to find dates, and everything just sort of hurts. Would anyone in their right mind just say, "Embrace it! Most 40 year old tech dudes look about like you and are in the same situation! It's fine!" No. Of course not. You have to realize that your priorities have been totally broken for the last 15-20 years of your life, that you really weren't getting shit done, and you have to take some responsibility for your diet and get off your ass and exercise.

That's what companies have to do. They won't, of course. But they have to, otherwise they'll die young deaths. This article is totally correct when she recognizes a terrible symptom of unhealthy companies. But her treatment is hopelessly and tragically wrong.


This works great if you're building something with a tightly controlled API.

If, however, your configuration space grows to an even middling size, it no longer becomes feasible to do much of this validation across the configuration space. A good example is any system where the user can customize system aspects. Do you run all of your integration tests across the full configuration space?

Additionally managing configuration skew between a dev and prod environment is not simple. Simply claiming that there should be no skew doesn't work. Often you want the prod and dev environments to run as different users, and you certainly want them to have different acls (your dev environment should not have access to your production database).

So you now have to, across your configuration space, validate that only the things that are "supposed" to be different differ, and that the things that aren't don't. Which maybe works for a while, but your prod configuration may also differ across parts of prod if, for example, a change is being canaried or incrementally deployed.

I've spent a non-trivial amount of effort on trying to solve the one problem of configuration skew between dev and prod for one real system. It's ultimately not worth it. The effort expended to "fix" that would be more work, than not. And I mean that in the long term, the effort to maintain and follow the rules that such a system would impose is more effort than dealing with the annoyances of unintended skews.

Systems are hard because systems are hard. There's no good company that doesn't, test/experiment in production. All of them do.


Thats a weird point you are making.

Yes i would test basic / standard customizations a customer would do.

I would test the customization system itself.

If it is to complex, your cusotmiztaion system will bring you much more issues later on.


I didn't say that any of this was simple or non-trivial. Again, it depends on your priorities and your values as well as your company culture. In fact, I specifically said that testing systems are hard and provided examples of how hard systems are to test. Do you think that Cassandra is a tightly controlled API with a small configuration space?

You seem to feel like close enough is good enough. And that's the cause of the problem I'm trying to address here. Does it really matter if you don't get a notification when someone messages you on Facebook? Or if you get two notifications? Is that particular problem worth testing every possible Kafka configuration? I think that you are saying is no, it doesn't matter.

But I'm arguing a different point. I'm not arguing about whether the testability of any individual feature is important. For obvious reasons: some features really just aren't that important. But not being able to do that, and actively choosing not to understand that system is a symptom of a far deeper problem. When a company makes the choice you have just described, the company has decided to accept that they can't, won't, and will never fully understand their own systems. It's often not a conscious decision, it's a decision made by habit, policy, and culture, which is what's so subversive about it. People don't make big-picture decisions to intentionally have a system that is unknowable/untestable. People make small decisions just like the ones you are talking about that make systems that way. And it's the practice of letting lots of disconnected people make the small decisions of what does and doesn't matter, what is and isn't worth it that destroys systems.

Systems are hard, and I agree with that, but systems are made even more so by bad process.

The being old analogy didn't seem to resonate with you, which is fine. But let me ask you a question about a system.

You have a database. It gets backed up every night. Or maybe every hour. Your job is to take snapshots and store them because that's what you're supposed to do. Yeah, I know, that should be or can be automated. Whatever.

The big picture system and purpose is that you are supposed to be able to recover from a hardware failure/data loss. But that's not your problem. Your problem is that you have to back up the database manually every day. The data team only tests restoring backups from dev to dev instead of prod to dev. Because reasons. Because it's hard.

That type of backup system checks all the boxes you're supposed to check when you get audited. Or at least enough to get through it. But when you really need to understand the system, it fails for all kinds of reasons and people are sitting around looking at each other saying, "well I did what I was supposed to do."

Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

Laziness is a good trait in an individual programmer. But laziness is the absolute death of an organization. Agile is really just distributed, organizational laziness. That's what creates horrible, unknowable systems.

Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

The article is talking about rolling the dice in production deployments and claiming that's fine and something to be proud of. It isn't fine, and it's not something to be proud of. She's the CEO. She should fix her company instead of being proud of how bad it is.

A lot of what we're talking about here is a matter of perspective. And that is the problem I'm taking to task both with you and with the article.


> I think that you are saying is no, it doesn't matter.

No I'm not saying that. I'm saying that the best way to prevent that isn't always to have a staging environment that mirrors production as well as you can.

> Individuals sitting around making isolated, disconnected decisions like the ones you're talking about (i.e., it just isn't worth it; it's not feasible; it's hard) compound in organizations and create the kinds of systems you don't want to deal with. You're making your own hell here. You seemed to have missed that key point in my earlier comment.

No, this was an intentional decision by the organization, that the organization shouldn't continue to invest time in solving the problem this way, because after significant effort expended by the organization, the conclusion of the people who the organization asked to investigate the problem was that solutions would not be feasible and would not improve things. You're acting like these decisions are always made in a vacuum. They're not. Often smart organizations investigate and make decisions at the level of leadership.

> Conflating test/experiment with what the original article claimed to be talking about (and then later walked back) is borderline disingenuous. No one is talking about A/B testing or intentional experiments.

Are you sure?

FTA:

> We conduct experiments in risk management every single day, often unconsciously. Every time you decide to merge to master or deploy to prod, you’re taking a risk.

> A healthy culture of experimentation and testing in production pulls together all three.

Canarying is just testing in production, but you have processes and "guardrails" (quoting the article) to make sure that it is done safely by default.

For the record, I work primary on reliability and release/experiment, and so I'm well aware that being explicit about your decisions is vital, as is knowing the tradeoffs involved. That's why pretending that you don't test in prod is a bad idea, because you almost assuredly do. That's what the article is saying.

Edit: As for Cassandra, it looks like they have system bugs caught in production, so I'm not sure what your point is (https://issues.apache.org/jira/projects/CASSANDRA/issues/CAS...)


"Engineers should be on call for their own code." - Would you rather work someplace you are expected to be on call 24/7, or a company that doesn't require that?

It isn't the norm, and it isn't competitive. It's just more "always on" culture in the workplace - and that's not healthy. A company should understand workers need real breaks - and being on call is not a real break.


Want to jump in here - I have worked at a company where engineers are not on call for their code, and it was a living nightmare.

_You_ might not be on call for your code, but _somebody_ will be. Often some poor SRE/ops person that has absolutely no idea what the app is doing/or why it's failing in production.

Not being on-call makes engineers complicit. I've seen it all, known memory leaks shipped into production, apps where half the endpoints couldn't even be compiled, code dumping the production redis at 1AM ... and every time the pain just felt on deaf ears.

If your code is what wakes you up in the middle of the night, you have: - Incentive to fix/mitigate as soon as possible. - No blame game to play. Either the error was made by you, or someone on your team. It doesn't have to go up 3 rungs on the ladder then back down again.

I don't think the author was suggesting that everyone should always be on call, just that you _must_ be responsible for your own code in production


I'm always happy to help some poor SRE in the middle of the night, and I once even drove to the office in a rainy Sunday, in the middle of my vacation, to access IP-restricted stuff because a support intern messaged me on Instagram.

...but with that said: I'm glad I only worked in countries where work is properly regulated and "on call" means "I'm getting fucking paid every cent for each hour I _must_ answer that goddamn phone". Which in practice means there's no PagerDuty.

The unpaid on-call culture is bullshit. The company can either pay me or go fuck itself.


I unfortunately work in a place where on-call is unpaid. I'm an SRE stuck in the 90s.

The policy states that only the Operations team gets paid on-call, because I guess in the old days they would be the expected to deal with production.

Fast forward to today, and the Operations folks are a small team managing 2 datacentres, and all on-call rotations between SREs and developers are considered unofficial and therefore not eligible to be paid.

One of our Sr. Managers tried to take this up the chain, but then got reprimanded for putting developers on-call.


Now apply the same rules to the SRE role as well.


I've found the problem isn't being on call, it's being on call PLUS not having control over priorities.

When doing it right, owning the on call experience can be a valuable learning experience. But this usually means having extra time to do things like develop integration and performance testing environments. And access to make whatever changes needed to happen, happen.

But a lot of places are like "nah, you're being a perfectionist". And then expect you to magically respond to issues with vague descriptions and no diagnostics. And yeah, that sucks.


When you're Oncall you aren't Oncall 24/7 forever. It usually rotates amongst the engineers in the team. I'm on a team with about 10 engineers in it so you're on call about a week every two months. I call that manageable.

Engineers should 100% be responsible for owning their code, and fixing any issues that arises from it. After all they're the ones that wrote it, aren't they the best people to fix it when it breaks?


I don't imagine Charity (article author) was implying that the Amazon method of being on-call for your code was ideal.

I took it to mean that code ownership is important and you should be responsible for fixing things when your code blows up


I would argue that using the right words is important here. “On call” typically refers to a specific thing, where if something breaks that person on call is, well, called (or paged, etc.) to fix it when it happens, even at 2am. This is how I took the meaning of that paragraph as well.

If the author meant that code ownership is important, or that the engineer(s) who wrote the code are responsible, that message could have been conveyed by saying, “the persons who wrote the code should test it once it’s deployed and are responsible for fixing if it breaks or is broken when deployed.” This is much clearer and doesn’t use terms that could be understood incorrectly.


Being on call sucks, clearly. But, the benefit of engineers oncall for their code is that it makes a pretty effective feedback loop --- the person who breaks the thing fixes the thing, and learns to break the thing less often, or to break it earlier in the business day so as not to ruin their evening, or to make it run better in degraded modes so it's ok to be broken for longer and alerts can be acknowledged amd dealt with later.

I like working in small teams because there's less required communication. Having the oncall be the engineer means the oncall doesn't have to communicate with the engineer, they're always up to date because it's one person (subject to sleep deprevation issues).

It's certainly not good for work/life balance though. Some production issues are unavoidable, automation for the common ones can help.

Edit to add: if you're oncall and your alerts are mostly because your dependencies are bad at their job, and you aren't empowered to do anything about that; having the engineer oncall isn't useful. It's only useful if the engineer is in a place to make changes to reduce future alerts.


And then that one person gets hit by a bus and you go out of business. Very-interconnected large-scale systems rarely have failure modes that are as simple as something the dev did/didn't do.


It seems like about half of the postmortems I've seen (public ones for high profile things and private ones where I've worked) have the incident start either when someone pushed a change, or sometime after the change was pushed when the change blew up; this is why change moratoriums are so effective --- when people stop messing with the system, it becomes stable.

Another large portion is power transfer switches failing. Then you have redundant cicso products failing to fail over properly often resulting in 30 seconds-5 minutes of network connectivity and then (if you're reading a postmortem) cascading failures. After that it's one off partial hardware failures where things worked enough to meet healthchecks but not enough to do actual work (my favorites are things like ECC is correcting errors at such a high rate that the system is using 90%+ cpu on servicing machine check exceptions or somehow system booted with 64MB of ram instead of 4 GB and is running from swap, miraculously)

You can obsess about bus factor, or you can hire people who are good at figuring out complex systems with no documentation and if someone leaves, assign someone with good overall system knowledge to their system until you can find a new dedicated person.


Arguing in favor of more than one person per project is not "obsessing" over bus factor lol. I want to be able to take days off, and I want my coworkers to enjoy the same.

The kind of takeaway I'd want to see from your first example is less like "don't do the things we know will cause breakage when we can't tolerate breakage" and more like "develop runtime-gating of new features and a way of sampling or shadowing production traffic onto n+1 builds before they are eligible to become the released build".

I've also had many issues with dodgy hardware of all types forever-circling repair queues in large fleets and never had a satisfying outcome for it either. Hopefully one of these days.


The whole point of putting engineers on call (in this context) is to encourage them to make good technology choices and to take some ownership for their products. If there was no counter-pressure with pager duty or other threats to peaceful existence, then most developers would just pick whatever technology they personally enjoy using the most and expect that someone else will fix their special shitpile for them at 3am. Someone is always going to get screwed in this equation, at least make it an equitable screwing.

Being on-call doesn't just apply to code either. Would you be OK if no one tried to fix your broken water pipes or electricity until the following business day? Do we turn off the global internet at bed time?

At some point people are going to have to do shitty work to keep this world running. The best you can do is rotate the shitty work around so that everyone can help out. Automate what you can, share the load for what you cannot. If everyone does their part, it is a lot less painful all around.


There's more going on and worth exploring if engineers are so unattached to outcomes they pick technology in a vacuum. Pager duty is a heavy stick - it's important not to avoid root cause analysis. So if your engineers are making bad choices - what's really going on?

That assumes engineers are even empowered to make technology choices. At many companies they are not (whether by dint of organizational structure or the roadmap not allowing a major technology shift from whatever "shitpile" you and your team have inherited).

Having clear escalation strategies (and knowing when escalation to the original engineers behind a project is even appropriate) is often lacking. I wouldn't want to call engineers in at 3am for a problem that can be fixed by following a documented devops process. Plus - what happens when the engineer you need to reach is unavailable? They are sick, or don't wake up, or their phone died?

What happens when business pressure says "we're ok with calling engineers twice a week as long as the roadmap moves"?

"You built it you're on call" is a fragile way to handle problems in more ways than one.

Which isn't to say there shouldn't be shared responsibility. Of course there should. But responsibility without power is toxic. At the very least it increases flight risk - but in practice often has a far wider reaching deleterious effect than just that.


> I wouldn't want to call engineers in at 3am for a problem that can be fixed by following a documented devops process.

Why would there be a process that could be executed that wouldn't already be automated? If the ops guy is dealing with an issue, it's because all the known remediations have failed.

Containers already have auto restart on failed health checks. VMs have vmotion and HA for failed hardware. If the ops guy is up at 3am, dealing with a service you wrote, chances are high that you (or your team) should be involved for the quickest resolution


A documented process can be automated.

Humans should be paged only when this is a new category of failure, and in that case, having the developer wake up first triggers a really good feedback loop.


I think the model that I prefer is: "if you're going to deploy some code in the evenings or beginning of the weekend, you're on call for the code you added until work day tomorrow"

And have it be only around the code you just added.

If you deploy (and validate) earlier in the day or not on Friday ( or whatever the end-of-week day is) then that requirement is gone.

I've definitely seen code get deployed at 8pm on a Friday night; and, who knows if it was guaranteed to work. That person definitely should be on call for it.


Test in prod is terrible. Write unit and integration tests, test on staging, and then monitor Prod.


Sorry to make the assumption that you didn't read the article, but it really sounds like you didn't and are just making a comment about the headline. I would recommend reading the article.


Yes, I think the headline must reflect the content of the article, otherwise it confuses me and make me write false comments and wastes my time.


That’s really close to what the article says. It also suggests flexing the feature in Prod, which is a totally valuable thing.

The article is mostly about the “Chuck it over the fence” mentality some developers and engineering cultures have about code that has shipped into production.


Advice like this is good for applications that place a high value on stability and quality, which aren't deployed frequently, and which are monolithic (or rather, which are deployed monolithically--microservices that are deployed all at once are less a problem as it pertains to testing).

However, the further your application deviates from these invariants, the less good this advice becomes. And there are good reasons to deviate--many applications exist in a business context in which production bugs are often not a very big deal (e.g., social networks); however, the ability to iterate rapidly (especially the ability to test new features, etc rapidly) is paramount. In these contexts, it's entirely appropriate to trade off the ability to test in advance in exchange for the ability to iterate more rapidly--this includes things like moving to a microservice deployment cadence (different components deployed at different times to elide expensive cross-team deployment coordination moments). Microservices are harder to test to the same confidence interval because it's hard to fix the versions of all of the different components in order to assure reproducible tests, and the value of doing so is diminished because in practice any given combination of versions is likely short-lived since one or more components will be redeployed in a matter of hours or even minutes. Of course, unit testing and integration testing are still valuable in these contexts, but they aren't as valuable as they are in other contexts, so testing in prod (coupled with the ability to quickly fix bugs) is necessary to make up the difference.

So like everything in software development, "it depends".


I usually try to make all my development environments the same. Make them differ only in terms of infrastructure, not code.

The more code differences you have between environments, the less valuable pre-production testing becomes.


The author apparently didn’t get the sarcasm in the "I test in prod" meme.


I think that's unfair - my reading is the article was that no matter how well you test before prod, there are emergent factors that cannot be tested for - and I agree. Test what you can pre release and have good tools to handle failures and be aware of the risk landscape.


The author originated the meme. It is not meant to be interpreted as sarcastic.


So much the worse.


On the contrary, it's the best strategy for reducing risk in modern web service architectures.


Is that meme sarcastic? I thought it was an expression of dismay.


Or he actually did, which makes it even worse. Either way this went wrong.


> We’re a startup. Startups don’t tend to fail because they moved too fast. They tend to fail because they obsess over trivialities that don’t actually provide business value. It was important that we reach a reasonable level of confidence, handle errors, and have multiple levels of fail-safes (i.e., backups)."

Do established companies, or stopdown, fail because they moved too fast? Can we dismiss Charity's astonishingly good advice over perceived startupiness? That's dumb.


This article is inane to me. Someone decided to write a lengthy blog post on the deliberate and obvious misinterpretation of a meme.

DevOps and Observability are well described disciplines. I don't know anyone who has uttered the meme "I don't always test, but when I do, I test in production" who meant that monitoring, observing and reviewing prod behavior is bad. They invariably meant that they were rushed to release things in prod before it was adequately verified to ensure your customers don't have crappy experiences.


Another read is that it’s just a hook to get into a meaningful subject - observability can be considered part of the test process. The inflammatory title gets clicks!

The author’s product is based on folks accepting this premise and seeking tools to do better.

I do think the setup to get to the substance is a bit long and indirect - but it’s not a given, especially at series B/C startups that folks think about production monitoring as part of the QA process so I do like to see more pieces on that.


The opinion of the article is that the catchphrase has become an excuse to build poor observability and in-prod testing tools, and we should be better about that.


The problem is that this article is misusing the catchphrase and now people are discussing the misuse of the meme instead of the article's content.


> The opinion of the article is that the catchphrase has become an excuse to build poor observability and in-prod testing tools, and we should be better about that.

Is that a problem that actually exists though? I've certainly never encountered it, as the types of teams the are strict about good pre-release testing are also the types of teams that are big on observability.


Haha that's why I said "opinion". If you're curious about my opinion. I think there are some good points, that observability tools could be better. I think "throw it over the fence to ops" is a slow-boiling problem coming down the pike. Are observability tool abjectly poor though? I don't think so.


I didn't get that. It seems to me the point was there's too many developers feel their responsibility ends when they hit deploy.


Couldn't agree more. Of course you use your product and do service quality measurements. What you don't do is toss "untested" code at your customers (possible exception here is hotfixes), which is what the meme means. Bad post written by someone either intentionally misinterpreting the meaning or dangerously unqualified to understand it.


> willfully misinterpreting

Often, when someone looks like they are “willfully misinterpreting” something, they are actually presenting uncomfortable implications or experiences of how a phrase is put into practice. Right or wrong, it is irritating to listen to.

Is it worth enduring your irritation in order to see if someone is meaningfully correct?

—————

> Don’t toss “untested” code.

“tested” is not a binary state. Code could always be better-tested —- otherwise why would you have hotfixes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: