centos性能监控(centos查看gpu信息)

本篇文章给大家谈谈centos性能监控,以及centos查看gpu信息对应的知识点,文章可能有点长,但是希望大家可以阅读完,增长自己的知识,最重要的是希望对各位有所帮助,可以解决了您的问题,不要忘了收藏本站喔。

pcp套件(PCP套件(监控和性能分析工具集))

PCP套件是一个用于性能监控和分析的工具集。它可以监控各种系统指标,如CPU使用率、内存使用率、磁盘I/O等。同时,它还提供了一些强大的分析工具,可以帮助用户更好地理解系统的性能瓶颈,并优化系统性能。

PCP套件的安装

PCP套件可以在Linux系统上安装。在CentOS系统上,可以使用以下命令安装:

```

yuminstallpcppcp-webapipcp-manager

```

在Ubuntu系统上,可以使用以下命令安装:

```

sudoapt-getinstallpcppcp-webapipcp-manager

```

安装完成后,可以使用以下命令启动PCP服务:

```

systemctlstartpmcd

```

PCP套件的使用

PCP套件提供了多种工具,可以用于监控和分析系统性能。下面介绍几个常用的工具。

1.pmstat

pmstat是一个用于实时监控系统性能的工具。它可以显示各种系统指标,如CPU使用率、内存使用率、磁盘I/O等。可以使用以下命令启动pmstat:

```

pmstat

```

2.pminfo

pminfo是一个用于查询系统指标信息的工具。可以使用以下命令查询CPU使用率:

```

pminfo-fcpu.util

```

3.pmchart

pmchart是一个用于绘制系统指标图表的工具。可以使用以下命令绘制CPU使用率图表:

```

pmchart-S1m-T"CPUUtilization"-x60-l0-r100-L"CPUUtilization(%)"-Y1-Ocpu.util

```

4.pmdumptext

pmdumptext是一个用于导出系统指标数据的工具。可以使用以下命令导出CPU使用率数据:

```

pmdumptext-t60-T"CPUUtilization"-C"CPUUtilization(%)"-Ocpu.util

```

PCP套件的优势

PCP套件具有以下优势:

1.精确度高:PCP套件可以监控各种系统指标,精确度高,可以帮助用户更好地理解系统的性能瓶颈。

2.易于使用:PCP套件提供了多种工具,可以用于监控和分析系统性能。这些工具使用简单,易于上手。

3.可扩展性强:PCP套件可以扩展到数千个主机和数百万个指标,可以满足各种规模的系统监控需求。

在Linux中使用Smartctl监控磁盘性能的方法

Smartctl(S.M.A.R.T自监控,分析和报告技术)是类Unix系统下实施SMART任务命令行套件或工具,它用于打印SMART自检和错误日志,启用并禁用SMRAT自动检测,以及初始化设备自检。

Smartctl对于Linux物理服务器十分有用,在这些服务器上,可以对智能磁盘进行错误检查,并将与硬件RAID相关的磁盘信息摘录下来。

在本帖中,我们将讨论smartctl命令的一些实用样例。如果你的Linux上海没有安装smartctl,请按以下步骤来安装。

安装 Smartctl

对于 Ubuntu

复制代码代码如下:$ sudo apt-get install smartmontools

对于 CentOS& RHEL

复制代码代码如下:# yum install smartmontools

启动Smartctl服务

对于 Ubuntu

复制代码代码如下:$ sudo/etc/init.d/smartmontools start

对于 CentOS& RHEL

复制代码代码如下:# service smartd start; chkconfig smartd on

样例

样例:1检查磁盘的 Smart功能是否启用

复制代码代码如下:root@linuxtechi:~# smartctl-i/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION===

Model Family: Seagate Momentus 5400.6

Device Model: ST9320325AS

Serial Number: 5VD2V59T

LU WWN Device Id: 5 000c50 020a37ec4

Firmware Version: 0002BSM1

User Capacity: 320,072,933,376 bytes [320 GB]

Sector Size: 512 bytes logical/physical

Rotation Rate: 5400 rpm

Device is: In smartctl database [for details use:-P show]

ATA Version is: ATA8-ACS T13/1699-D revision 4

SATA Version is: SATA 2.6, 1.5 Gb/s

Local Time is: Sun Nov 16 12:32:09 2014 IST

SMART support is: Available- device has SMART capability.

SMART support is: Enabled

这里‘/dev/sdb’是你的硬盘。上面输出中的最后两行显示了SMART功能已启用。

样例:2启用磁盘的 Smart功能

复制代码代码如下:root@linuxtechi:~# smartctl-s on/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF ENABLE/DISABLE COMMANDS SECTION===

SMART Enabled.

样例:3禁用磁盘的 Smart功能

复制代码代码如下:root@linuxtechi:~# smartctl-s off/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF ENABLE/DISABLE COMMANDS SECTION===

SMART Disabled. Use option-s with argument'on' to enable it.

样例:4显示磁盘的详细 Smart信息

复制代码代码如下:root@linuxtechi:~# smartctl-a/dev/sdb// For IDE drive

root@linuxtechi:~# smartctl-a-d ata/dev/sdb// For SATA drive

样例:5显示磁盘总体健康状况

复制代码代码如下:root@linuxtechi:~# smartctl-H/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION===

SMART overall-health self-assessment test result: PASSED

Warning: This result is based on an Attribute check.

Please note the following marginal Attributes:

ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE

190 Airflow_Temperature_Cel 0x0022 067 045 045 Old_age Always In_the_past 33(Min/Max 25/33)

样例:6使用long和short选项测试硬盘

Long测试

复制代码代码如下:root@linuxtechi:~# smartctl--test=long/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION===

Sending command:"Execute SMART Extended self-test routine immediately in off-line mode".

Drive command"Execute SMART Extended self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 102 minutes for test to complete.

Test will complete after Sun Nov 16 14:29:43 2014

Use smartctl-X to abort test.

或者,我们可以重定向测试输出到日志文件,就像下面这样

复制代码代码如下:root@linuxtechi:~# smartctl--test=long/dev/sdb>/var/log/long.text

Short测试

复制代码代码如下:root@linuxtechi:~# smartctl--test=short/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION===

Sending command:"Execute SMART Short self-test routine immediately in off-line mode".

Drive command"Execute SMART Short self-test routine immediately in off-line mode" successful.

Testing has begun.

Please wait 1 minutes for test to complete.

Test will complete after Sun Nov 16 12:51:45 2014

Use smartctl-X to abort test.

复制代码代码如下:root@linuxtechi:~# smartctl--test=short/dev/sdb>/var/log/short.text

注意:short测试将花费最多2分钟,而在long测试中没有时间限制,因为它会读取并验证磁盘的每个段。

样例:7查看驱动器的自检结果

复制代码代码如下:root@linuxtechi:~# smartctl-l selftest/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION===

SMART Self-test log structure revision number 1

Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

# 1 Short offline Completed: read failure 90% 492 210841222

# 2 Extended offline Completed: read failure 90% 492 210841222

样例:8计算测试时间估值

复制代码代码如下:root@linuxtechi:~# smartctl-c/dev/sdb

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION===

General SMART Values:

Offline data collection status:(0x00) Offline data collection activity

was never started.

Auto Offline Data Collection: Disabled.

Self-test execution status:( 121) The previous self-test completed having

the read element of the test failed.

Total time to complete Offline

data collection:( 0) seconds.

Offline data collection

capabilities:(0x73) SMART execute Offline immediate.

Auto Offline data collection on/off support.

Suspend Offline collection upon new

command.

No Offline surface scan supported.

Self-test supported.

Conveyance Self-test supported.

Selective Self-test supported.

SMART capabilities:(0x0003) Saves SMART data before entering

power-saving mode.

Supports SMART auto save timer.

Error logging capability:(0x01) Error logging supported.

General Purpose Logging supported.

Short self-test routine

recommended polling time:( 1) minutes.

Extended self-test routine

recommended polling time:( 102) minutes.

Conveyance self-test routine

recommended polling time:( 2) minutes.

SCT capabilities:(0x103b) SCT Status supported.

SCT Error Recovery Control supported.

SCT Feature Control supported.

SCT Data Table supported.

样例:9显示磁盘错误日志

复制代码代码如下:root@linuxtechi:~# smartctl-l error/dev/sdb

Sample Output

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic](local build)

Copyright(C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION===

SMART Error Log Version: 1

ATA Error Count: 5

CR= Command Register [HEX]

FR= Features Register [HEX]

SC= Sector Count Register [HEX]

SN= Sector Number Register [HEX]

CL= Cylinder Low Register [HEX]

CH= Cylinder High Register [HEX]

DH= Device/Head Register [HEX]

DC= Device Command Register [HEX]

ER= Error register [HEX]

ST= Status register [HEX]

Powered_Up_Time is measured from power on, and printed as

DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,

SS=sec, and sss=millisec. It"wraps" after 49.710 days.

Commands leading to the command that caused the error were:

CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name

----------------------------------------------------

25 da 08 e7 e5 a5 4c 00 00:30:44.515 READ DMA EXT

25 da 08 df e5 a5 4c 00 00:30:44.514 READ DMA EXT

25 da 80 5f e5 a5 4c 00 00:30:44.502 READ DMA EXT

25 da f0 5f e6 a5 4c 00 00:30:44.496 READ DMA EXT

25 da 10 4f e6 a5 4c 00 00:30:44.383 READ DMA EXT

Centos 7 安装prometheus监控GPU流程

安装nvidia-container-runtime推荐配置步骤如下:

首先,安装nvidia-container-toolkit以支持GPU容器化环境。设置其存储库和GPG密钥,将experimental分支添加到存储库列表中,更新包列表并安装nvidia-container-toolkit包。配置Docker守护进程以识别NVIDIA容器运行时,设置默认运行时后重启Docker守护进程以完成安装。

推荐安装nvidia-container-runtime,配置其源并完成安装后,重启Docker。

安装NVIDIA监控,本地执行curl localhost:9400/metrics以获取相关信息。

安装node_exporter-1.5.0.linux-amd64并创建服务。启动服务以监控NVIDIA设备。

监控主机配置包括:

配置prometheus.yml以定义监控规则和目标。默认情况下,prometheus数据保存为15天,可根据需要进行修改。

使用Docker安装prometheus,实现对系统状态和性能的持续监控。

安装grafana作为可视化工具,以图形化方式展示prometheus监控数据,便于分析和诊断。

阅读剩余
THE END