urbanists.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
We're a server for people who like bikes, transit, and walkable cities. Let's get to know each other!

Server stats:

543
active users

#sre

29 posts22 participants0 posts today
Replied in thread

System Administration

Week 12, System Security III: From the Attack Life Cycle to Zero Trust

In this video, we continue approaching System Security from an attacker's point of view by understanding their common processes, following the Attack Life Cycle (with doggos!) and then identifying how our defenses, supported by the Zero Trust model, can interrupt each stage.

youtu.be/mNxHw5XzxJw

Replied in thread

System Administration

Week 12, System Security II: Defining a Threat Model

In this video, we look at the concept of a Threat Model and how the attack economics may shift based on your adversaries capabilities and motives. We introduce the STRIDE and DREAD models and draw a few circles, of course.

youtu.be/nZQboq3gjgg

Replied in thread

System Administration

Week 12, System Security I: Risk Assessment

In this video, we begin our dedicated discussion of System Security with a look back at how we've talked about security relevant aspects in previous videos and then moving forward to defining how we can begin to assess risk rather than attempt to "secure" a system.

youtu.be/KZi9ZWF6vWI

Replied in thread

System Administration

Week 11, Configuration Management II

In this video, we continue our discussion of configuration management systems. We talk about state assertion, what states of a host we might care about, the CAP theorem and other fallacies of distributed systems, idempotence, eventual consistency and convergence, and the overlap of CM systems with other infrastructure components, yielding, eventually, infrastructure as a service.

youtu.be/FJSpmBPv1J4

Replied in thread

System Administration

Week 11, Configuration Management I

In this video, we illustrate the general evolution of the management of system configuration and then talk about defining services by abstracting individual requirements for system-specific and service-specific aspects. We present a few sample snippets of Puppet, Chef, and CFEngine code to give you a taste of some common CM systems.

youtu.be/pY0mCH7tpR0

@ChrisLAS @ironicbadger really sad to hear about the #selfhosted #podcast reaching #EOL, I've been with you since the single-digit episodes, was an #SRE supporter then Jupiter.party, it was SelfHosted that brought me to #JupiterBroadcasting all those years ago.

Will be really sad to see it go, the cadence was great, and you two made wonderful hosts.

Sorry about the #AdWinter, afraid that is what is doing in so many JB shows like SH and Coder Radio.

On Friday which is typically a payday for weekly wages workers, there was some kind of outage that prevented #HomePay (from Care[dot]com) from paying out salaries to domestic workers like nannies, maids, babysitters, etc. They subsequently had a message on their website login screen, but for most of the day for many there was no clarity on when the funds would be dispersed to the workers. Customer care had long wait times due to this issue too. Since funds are typically collected on the Wednesday before, it was already gone from the families accounts who employed them. They eventually sent out communication indicating the delayed payroll would be paid out on Monday.

I’m surprised with such a major payroll platform having a payout outage and there was no news coverage I could find on the subject. I’m really interested in understanding what was the technical issues causing this problem. Also what banking service is HomePay using?

Your logs are lying to you - metrics are meaner and better.

Everyone loves logs… until the incident postmortem reads like bad fan fiction.
Most teams start with expensive log aggregation, full-text searching their way into oblivion. So much noise. So little signal. And still, no clue what actually happened. Why? Because writing meaningful logs is a lost art.
Logs are like candles, nice for mood lighting, useless in a house fire.

If you need traces to understand your system, congratulations: you're already in hell.

Let me introduce my favourite method: real-time, metric-driven user simulation aka "Overwatch".

Here's how you do it:

🧪 Set up a service that runs real end-to-end user workflows 24/7. Use Cypress, Playwright, Selenium… your poison of choice.
📊 Every action creates a timed metric tagged with the user workflow and action.
🧠 Now you know exactly what a user did before everything went up in flames.

Use Grafana + InfluxDB (or other tools you already use) to build dashboards that actually tell stories:

* How fast are user workflows?
* Which steps are breaking, and how often?
* What's slower today than yesterday?
* Who's affected, and where?

🎯 Alerts now mean something.
🚨 Incidents become surgical strikes, not scavenger hunts.
⚙️ Bonus: run the same system on every test environment and detect regressions before deployment. And if you made it reusable, you can even run the service to do load tests.

No need to buy overpriced tools. Just build a small service like you already do, except this one might save your soul.

And yes, transform logs into metrics where possible. Just hash your PII data and move on.

Stop guessing. Start observing.
Metrics > Logs. Always.