System Hung after message INFO: task oracle:2915 blocked for more than 120 seconds.

Error:
System Hung after message INFO: task oracle:2915 blocked for more than 120 seconds.

Cause:
These messages means that the system is experiencing resource congestion and processes are being starved of available resources. But what cause the
real problem lets check out.

Troubleshooting:
1. Check the hung task configuration parameters.

[root@Server_issue ~]# sysctl -a | grep -i hung
kernel.hung_task_warnings = 0
kernel.hung_task_timeout_secs = 120 -- Number of seconds it waits before declaring a problem, if it is 0 then the hungtask warning is disabled.
kernel.hung_task_check_count = 4194304
kernel.hung_task_panic = 0 -- system do not panic on a hung task. if set to 1 then system do panic.
[root@Server_issue ~]#

2. Check if hungtaskd process is running.

[root@Server_issue ~]# ps aux | grep -i hungt
root 575 0.0 0.0 0 0 ? S Jun13 0:00 [khungtaskd]
root 5665 0.0 0.0 61216 820 pts/2 R+ 11:14 0:00 grep -i hungt
[root@Server_issue ~]#

3. These are alerts found in dmesg.
INFO: task oracle:2915 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
oracle D ffff81000103f1a0 0 2915 1 2917 2913 (NOTLB)
ffff8104080779d8 0000000000000086 0000000000000003 0000000000000001
ffff81083816d100 000000000000000a ffff81033b6fb040 ffff81083fc83100
0000490012f22173 00000000000ea740 ffff81033b6fb228 0000000700000000
Call Trace:
[<ffffffff8006ece7>] do_gettimeofday+0x40/0x90
[<ffffffff887484c9>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
[<ffffffff800637de>] io_schedule+0x3f/0x67
[<ffffffff887484d2>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd -- process waiting for NFS
[<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66
[<ffffffff887484c9>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
[<ffffffff800639be>] out_of_line_wait_on_bit_lock+0x6c/0x78
[<ffffffff800a34d9>] wake_bit_function+0x0/0x23
[<ffffffff8874c038>] :nfs:nfs_commit_set_lock+0x3e/0x49
[<ffffffff8008e823>] enqueue_task+0x41/0x56
[<ffffffff8874d95d>] :nfs:nfs_commit_inode+0x37/0xad
[<ffffffff8008e88e>] __activate_task+0x56/0x6d
[<ffffffff8874315e>] :nfs:nfs_release_page+0x58/0x69
[<ffffffff800ce703>] shrink_inactive_list+0x513/0x8de

4. Checked SAR reports before system was Unresponsive. below are the commands used to find out the value.
[root@Server_issue ~]# sar -f /var/log/sa/sa13
05:40:03 PM all 6.92 0.00 5.19 15.80 0.00 72.09 -- High I/O wait
07:23:32 PM LINUX RESTART
[root@Server_issue ~]#sar -q -f /var/log/sa/sa13
05:40:03 PM 2 721 16.55 7.37 3.14 -- Load avarage was High
[root@Server_issue ~]# sar -b -f /var/log/sa/sa13
02:30:01 PM 1447.09 65.02 1382.06 16762.55 69566.65
03:30:01 PM 1450.74 62.64 1388.10 16739.12 69749.63
04:30:01 PM 1446.52 66.05 1380.46 16779.18 69751.73
05:40:03 PM 985.57 336.69 648.89 98707.05 11227.14 -- block read/write was high, there spikes in this value in certain interval.
[root@Server_issue ~]#sar -n NFS -f /var/log/sa/sa13
05:40:03 PM 735.50 0.00 13.86 703.48 5.35 11.73 -- Number of RPC call made are high. Number of write' RPC calls made per second is high.

5. Found that NFS taking longer time to transfer data.
Could you please check if the time taken for data travel is normal? I’ve tried sending 8KB data packets (which is DB backup policy data transfer rate) which is taking 2.43ms in an average. Which is normally 1ms higher than other servers. I tried this on 2 other servers, reference are given below.

[root@Server_issue ~]# ping -s 8000 NFS_Server.example.com
PING NFS_Server.example.com (NFS_Server.example.com IP) 8000(8028) bytes of data.
8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=1 ttl=63 time=2.42 ms
8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=2 ttl=63 time=2.42 ms
8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=3 ttl=63 time=2.42 ms
8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=4 ttl=63 time=2.42 ms
8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=5 ttl=63 time=2.48 ms
8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=6 ttl=63 time=2.44 ms

--- NFS_Server.example.com ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5003ms
rtt min/avg/max/mdev = 2.420/2.437/2.488/0.046 ms
You have new mail in /var/spool/mail/root
[root@Server_issue ~]#

[root@Server_issue ~]# ping -s 2048 NFS_Server.example.com
PING NFS_Server.example.com (NFS_Server.example.com IP) 2048(2076) bytes of data.
2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=1 ttl=63 time=2.32 ms
2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=2 ttl=63 time=2.29 ms
2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=3 ttl=63 time=2.32 ms
2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=4 ttl=63 time=2.30 ms
2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=5 ttl=63 time=2.29 ms

--- NFS_Server.example.com ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4000ms
rtt min/avg/max/mdev = 2.297/2.310/2.327/0.013 ms
[root@Server_issue ~]# ping NFS_Server.example.com
PING NFS_Server.example.com (NFS_Server.example.com IP) 56(84) bytes of data.
64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=1 ttl=63 time=1.94 ms
64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=2 ttl=63 time=1.92 ms
64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=3 ttl=63 time=1.94 ms
64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=4 ttl=63 time=1.93 ms

--- NFS_Server.example.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3003ms
rtt min/avg/max/mdev = 1.928/1.936/1.945/0.044 ms
[root@Server_issue ~]#

Test on other Servers:

[test Server 1:~]# ping test Server 12
PING test Server 12 (test Server 12 IP) 56(84) bytes of data.
64 bytes from test Server 12 (test Server 12 IP): icmp_seq=1 ttl=62 time=0.946 ms
64 bytes from test Server 12 (test Server 12 IP): icmp_seq=2 ttl=62 time=1.51 ms
64 bytes from test Server 12 (test Server 12 IP): icmp_seq=3 ttl=62 time=0.695 ms
64 bytes from test Server 12 (test Server 12 IP): icmp_seq=4 ttl=62 time=0.681 ms
64 bytes from test Server 12 (test Server 12 IP): icmp_seq=5 ttl=62 time=0.704 ms
64 bytes from test Server 12 (test Server 12 IP): icmp_seq=6 ttl=62 time=0.735 ms
64 bytes from test Server 12 (test Server 12 IP): icmp_seq=7 ttl=62 time=0.660 ms

--- test Server 12 ping statistics ---
7 packets transmitted, 7 received, 0% packet loss, time 6004ms
rtt min/avg/max/mdev = 0.660/0.848/1.517/0.288 ms
[test Server 1:~]# ping -s 8000 test Server 12
PING test Server 12 (test Server 12 IP) 8000(8028) bytes of data.
8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=1 ttl=62 time=1.13 ms
8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=2 ttl=62 time=1.41 ms
8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=3 ttl=62 time=1.22 ms
8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=4 ttl=62 time=1.17 ms
8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=5 ttl=62 time=2.06 ms
8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=6 ttl=62 time=1.03 ms

--- test Server 12 ping statistics ---
6 packets transmitted, 6 received, 0% packet loss, time 5001ms
rtt min/avg/max/mdev = 1.031/1.340/2.065/0.346 ms
[test Server 1:~]#

6.The system was hung due to resource crunch. However on further looking into the issue I found that this situation occurs due to Hung Tasks. These tasks are mainly waiting to write in NFS Share, then get into uninterruptible sleep (D). Which is chunking up the system Resources.

Resolution/Suggestion:
1. There is a NFS share on this system (/u02) which is mounted from NFS_Server.example.com:/backup . Is this being called periodically for some purpose?
2. If that is the case is it possible to tune these jobs so that these tasks can be distributed over time to decrease load on certain time frame.

DEVOPSZONES

Recent blogs

System Hung after message INFO: task oracle:2915 blocked for more than 120 seconds.

No comments

Contributors

Popular

Subscribe Us

Please Support this website

Devopszones Page

Recent

Comments

Frequent Topics