DEV Community

Cover image for 10 Essential Linux Commands Every Site Reliability Engineer (SRE) Should Know
Oliver Bennet for GraphPe

Posted on

10 Essential Linux Commands Every Site Reliability Engineer (SRE) Should Know

Introduction

In the role of a Site Reliability Engineer (SRE), knowing the right Linux commands is key to maintaining, monitoring, and troubleshooting complex systems. These commands can make everyday tasks smoother and ensure reliable system performance. Here’s a look at ten must-know commands that every SRE should keep handy.

1. uptime- System Uptime and Load

Description: Shows how long the system has been running and displays the average load.

Why It’s Important: Quick way to gauge system load and identify if the system is under stress.

Basic Usage: uptime

uptime output

2. journalctl - System Logs Access

Description: Displays logs collected by systemd.

Why It’s Important: Essential for troubleshooting errors or investigating performance issues, particularly on systemd-based Linux systems.

Basic Usage: journalctl -u nginx.service (Show logs for a specific service)

journalctl output

3. free- Memory Usage

Description: Displays memory usage including free, used, and cached memory.

Why It’s Important: Crucial for detecting memory leaks or planning memory upgrades.

Basic Usage: free -h (Displays in human-readable format)

free -h output

4. iostat- Input/Output Statistics

Description: Reports CPU and I/O statistics for devices and partitions.

Why It’s Important: Helps to identify I/O bottlenecks and assess disk performance.

Basic Usage: iostat -xz 1 (Shows extended statistics with per-second updates)

iostat

5. lsof - List Open Files

Description: Lists open files and the processes that opened them.

Why It’s Important: Useful for identifying resource usage and finding potential file descriptor leaks.

Basic Usage: lsof -i :80 (Lists processes using port 80)

lsof

6. dstat - System Resource Statistics

Description: Combines multiple monitoring tools in one command to display CPU, disk, network, memory, and process stats.

Why It’s Important: Provides a real-time overview of system health, combining various metrics in one view.

Basic Usage: dstat -cdnm (Displays CPU, disk, network, and memory usage)

dstat

7. curl- Test HTTP Services

Description: Transfers data from or to a server, often used for API and HTTP testing.

Why It’s Important: Allows you to troubleshoot and test endpoints and web services quickly.

Basic Usage: curl -I http://example.com (Fetches HTTP headers)

curl

8. ping and traceroute- Network Connectivity and Path Analysis

Description: ping checks connectivity to a remote host, while traceroute shows the path packets take.

Why It’s Important: Essential for diagnosing network issues and pinpointing connection problems.

Basic Usage: ping google.com / traceroute google.com

ping

tracepath description

9. sar - System Activity Report

Description: Collects and displays CPU, memory, network, and disk statistics.

Why It’s Important: Helps identify patterns and potential issues over time, especially useful for trend analysis.

Basic Usage: sar -u 1 5 (Displays CPU usage over 5 seconds)

sar

10. systemctl- Manage Services

Description: Controls system services, allowing you to start, stop, enable, and check service status.

Why It’s Important: Essential for service reliability, as it enables quick service restarts or status checks during incidents.

Basic Usage: systemctl restart nginx (Restart the Nginx service)

systemctl

Conclusion

These commands are invaluable tools for SREs to monitor, manage, and troubleshoot systems effectively. Mastering them can enhance both the reliability and resilience of your infrastructure, ultimately leading to a smoother and more efficient production environment.

If you would like to understand more Bash CLI and master yourself. Here is a video

🔗 Support my Work

▶️ Support by Subscribing my YouTube
▶️ Explore more open-source tutorials on my website
▶️ Follow me on X
Buy me a Coffee

Top comments (0)