OH:
"One chart to develop, one chart to manage them, one chart to combine them all, and in the darkness bind them"
OH:
"One chart to develop, one chart to manage them, one chart to combine them all, and in the darkness bind them"
System Administration
Week 12, System Security IV: Crypto means #Cryptography
In this video, we discuss the three areas cryptography can provide threat mitigations in -- confidentiality, integrity, and authenticity -- and what some common pitfalls are with each.
System Administration
Week 12, System Security III: From the Attack Life Cycle to Zero Trust
In this video, we continue approaching System Security from an attacker's point of view by understanding their common processes, following the Attack Life Cycle (with doggos!) and then identifying how our defenses, supported by the Zero Trust model, can interrupt each stage.
System Administration
Week 12, System Security II: Defining a Threat Model
In this video, we look at the concept of a Threat Model and how the attack economics may shift based on your adversaries capabilities and motives. We introduce the STRIDE and DREAD models and draw a few circles, of course.
System Administration
Week 12, System Security I: Risk Assessment
In this video, we begin our dedicated discussion of System Security with a look back at how we've talked about security relevant aspects in previous videos and then moving forward to defining how we can begin to assess risk rather than attempt to "secure" a system.
System Administration
Week 11, Configuration Management II
In this video, we continue our discussion of configuration management systems. We talk about state assertion, what states of a host we might care about, the CAP theorem and other fallacies of distributed systems, idempotence, eventual consistency and convergence, and the overlap of CM systems with other infrastructure components, yielding, eventually, infrastructure as a service.
The hardest part when you switch back and forth between Python and Bash is to remember putting the "fi" and "done".
System Administration
Week 11, Configuration Management I
In this video, we illustrate the general evolution of the management of system configuration and then talk about defining services by abstracting individual requirements for system-specific and service-specific aspects. We present a few sample snippets of Puppet, Chef, and CFEngine code to give you a taste of some common CM systems.
Want to grow your open source career? The #LiFTScholarship offers FREE training & certification for #DevOps, #SRE, #SysAdmins & more!
Apply by April 30: https://app.smarterselect.com/programs/102338-Linux-Foundation-Education
@ChrisLAS @ironicbadger really sad to hear about the #selfhosted #podcast reaching #EOL, I've been with you since the single-digit episodes, was an #SRE supporter then Jupiter.party, it was SelfHosted that brought me to #JupiterBroadcasting all those years ago.
Will be really sad to see it go, the cadence was great, and you two made wonderful hosts.
Sorry about the #AdWinter, afraid that is what is doing in so many JB shows like SH and Coder Radio.
System Administration
Week 10, Backups by example
In this video, we illustrate how to perform backups using tar(1) (overcoming xkcd/1168), dump(8) and restore(8), and rsync(1), both locally and to a remote system.
On Friday which is typically a payday for weekly wages workers, there was some kind of outage that prevented #HomePay (from Care[dot]com) from paying out salaries to domestic workers like nannies, maids, babysitters, etc. They subsequently had a message on their website login screen, but for most of the day for many there was no clarity on when the funds would be dispersed to the workers. Customer care had long wait times due to this issue too. Since funds are typically collected on the Wednesday before, it was already gone from the families accounts who employed them. They eventually sent out communication indicating the delayed payroll would be paid out on Monday.
I’m surprised with such a major payroll platform having a payout outage and there was no news coverage I could find on the subject. I’m really interested in understanding what was the technical issues causing this problem. Also what banking service is HomePay using?
this week I'm reading Human Factors in Systems Engineering
there are so many gems I've highlighted already but really vibed with how the author clearly and simply expressed the impact of writing docs "early" here
Are you looking for a new remote job? Browse 400+ remote positions from open source companies including @acquia @grafana @mozilla @wikimediafoundation and more on #OSJH
https://opensourcejobhub.com/jobs/?q=remote&utm_source=mosjh
#career #OpenSource #engineer #sales #security #marketing #CloudNative #developer #DevSecOps #SRE #FOSS
Want to grow your open source career? The LiFT Scholarship offers training & certs to help you level up—whether you're starting out or advancing.
Apply by April 30: https://app.smarterselect.com/programs/102338-Linux-Foundation-Education
a short lil blog post sharing how re-reading the evergreen etsy Debriefing Facilitation Guide helped me better investigate a mysterious sound....
Not sure if I asked this before: Does anyone use anything in particular to inject #apache logs into #SQL databases? I have been looking around and asking around and the only solid I got was "do not expect an apache module for that; it would introduce too much latency to each request" in #httpd@libera.chat.
Your logs are lying to you - metrics are meaner and better.
Everyone loves logs… until the incident postmortem reads like bad fan fiction.
Most teams start with expensive log aggregation, full-text searching their way into oblivion. So much noise. So little signal. And still, no clue what actually happened. Why? Because writing meaningful logs is a lost art.
Logs are like candles, nice for mood lighting, useless in a house fire.
If you need traces to understand your system, congratulations: you're already in hell.
Let me introduce my favourite method: real-time, metric-driven user simulation aka "Overwatch".
Here's how you do it:
Set up a service that runs real end-to-end user workflows 24/7. Use Cypress, Playwright, Selenium… your poison of choice.
Every action creates a timed metric tagged with the user workflow and action.
Now you know exactly what a user did before everything went up in flames.
Use Grafana + InfluxDB (or other tools you already use) to build dashboards that actually tell stories:
* How fast are user workflows?
* Which steps are breaking, and how often?
* What's slower today than yesterday?
* Who's affected, and where?
Alerts now mean something.
Incidents become surgical strikes, not scavenger hunts.
Bonus: run the same system on every test environment and detect regressions before deployment. And if you made it reusable, you can even run the service to do load tests.
No need to buy overpriced tools. Just build a small service like you already do, except this one might save your soul.
And yes, transform logs into metrics where possible. Just hash your PII data and move on.
Stop guessing. Start observing.
Metrics > Logs. Always.