-->

DEVOPSZONES

  • Recent blogs

    System Hung after message INFO: task oracle:2915 blocked for more than 120 seconds.

    Error:
    System Hung after message INFO: task oracle:2915 blocked for more than 120 seconds.

    Cause:
    These messages means that the system is experiencing resource congestion and processes are being starved of available resources. But what cause the 
    real problem lets check out.

    Troubleshooting:
    1. Check the hung task configuration parameters.

    [root@Server_issue ~]# sysctl -a | grep -i hung
    kernel.hung_task_warnings = 0
    kernel.hung_task_timeout_secs = 120 -- Number of seconds it waits before declaring a problem, if it is 0 then the hungtask warning is disabled.
    kernel.hung_task_check_count = 4194304
    kernel.hung_task_panic = 0 -- system do not panic on a hung task. if set to 1 then system do panic.
    [root@Server_issue ~]#

    2. Check if hungtaskd process is running.

    [root@Server_issue ~]# ps aux | grep -i hungt
    root       575  0.0  0.0      0     0 ?        S    Jun13   0:00 [khungtaskd]
    root      5665  0.0  0.0  61216   820 pts/2    R+   11:14   0:00 grep -i hungt
    [root@Server_issue ~]#


    3. These are alerts found in dmesg.
    INFO: task oracle:2915 blocked for more than 120 seconds.
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    oracle        D ffff81000103f1a0     0  2915      1          2917  2913 (NOTLB)
    ffff8104080779d8 0000000000000086 0000000000000003 0000000000000001
    ffff81083816d100 000000000000000a ffff81033b6fb040 ffff81083fc83100
    0000490012f22173 00000000000ea740 ffff81033b6fb228 0000000700000000
    Call Trace:
    [<ffffffff8006ece7>] do_gettimeofday+0x40/0x90
    [<ffffffff887484c9>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
    [<ffffffff800637de>] io_schedule+0x3f/0x67
    [<ffffffff887484d2>] :nfs:nfs_wait_bit_uninterruptible+0x9/0xd -- process waiting for NFS
    [<ffffffff80063922>] __wait_on_bit_lock+0x36/0x66
    [<ffffffff887484c9>] :nfs:nfs_wait_bit_uninterruptible+0x0/0xd
    [<ffffffff800639be>] out_of_line_wait_on_bit_lock+0x6c/0x78
    [<ffffffff800a34d9>] wake_bit_function+0x0/0x23
    [<ffffffff8874c038>] :nfs:nfs_commit_set_lock+0x3e/0x49
    [<ffffffff8008e823>] enqueue_task+0x41/0x56
    [<ffffffff8874d95d>] :nfs:nfs_commit_inode+0x37/0xad
    [<ffffffff8008e88e>] __activate_task+0x56/0x6d
    [<ffffffff8874315e>] :nfs:nfs_release_page+0x58/0x69
    [<ffffffff800ce703>] shrink_inactive_list+0x513/0x8de

    4. Checked SAR reports before system was Unresponsive. below are the commands used to find out the value.
    [root@Server_issue ~]# sar -f /var/log/sa/sa13
    05:40:03 PM       all      6.92      0.00      5.19     15.80      0.00     72.09 -- High I/O wait
    07:23:32 PM       LINUX RESTART
    [root@Server_issue ~]#sar -q -f /var/log/sa/sa13
    05:40:03 PM         2       721     16.55      7.37      3.14 -- Load avarage was High
    [root@Server_issue ~]# sar -b -f /var/log/sa/sa13
    02:30:01 PM   1447.09     65.02   1382.06  16762.55  69566.65
    03:30:01 PM   1450.74     62.64   1388.10  16739.12  69749.63
    04:30:01 PM   1446.52     66.05   1380.46  16779.18  69751.73
    05:40:03 PM    985.57    336.69    648.89  98707.05  11227.14 -- block read/write was high, there spikes in this value in certain interval.
    [root@Server_issue ~]#sar -n NFS -f /var/log/sa/sa13
    05:40:03 PM    735.50      0.00     13.86    703.48      5.35     11.73 -- Number of RPC call made are high. Number of write' RPC calls made per second is high.

    5. Found that NFS taking longer time to transfer data.
    Could you please check if the time taken for data travel is normal? I’ve tried sending 8KB data packets (which is DB backup policy data transfer rate) which is taking 2.43ms in an average. Which is normally 1ms higher than other servers. I tried this on 2 other servers, reference are given below.


    [root@Server_issue ~]# ping -s 8000 NFS_Server.example.com
    PING NFS_Server.example.com (NFS_Server.example.com IP) 8000(8028) bytes of data.
    8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=1 ttl=63 time=2.42 ms
    8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=2 ttl=63 time=2.42 ms
    8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=3 ttl=63 time=2.42 ms
    8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=4 ttl=63 time=2.42 ms
    8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=5 ttl=63 time=2.48 ms
    8008 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=6 ttl=63 time=2.44 ms

    --- NFS_Server.example.com ping statistics ---
    6 packets transmitted, 6 received, 0% packet loss, time 5003ms
    rtt min/avg/max/mdev = 2.420/2.437/2.488/0.046 ms
    You have new mail in /var/spool/mail/root
    [root@Server_issue ~]#

    [root@Server_issue ~]# ping -s 2048 NFS_Server.example.com
    PING NFS_Server.example.com (NFS_Server.example.com IP) 2048(2076) bytes of data.
    2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=1 ttl=63 time=2.32 ms
    2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=2 ttl=63 time=2.29 ms
    2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=3 ttl=63 time=2.32 ms
    2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=4 ttl=63 time=2.30 ms
    2056 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=5 ttl=63 time=2.29 ms

    --- NFS_Server.example.com ping statistics ---
    5 packets transmitted, 5 received, 0% packet loss, time 4000ms
    rtt min/avg/max/mdev = 2.297/2.310/2.327/0.013 ms
    [root@Server_issue ~]# ping NFS_Server.example.com
    PING NFS_Server.example.com (NFS_Server.example.com IP) 56(84) bytes of data.
    64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=1 ttl=63 time=1.94 ms
    64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=2 ttl=63 time=1.92 ms
    64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=3 ttl=63 time=1.94 ms
    64 bytes from NFS_Server.example.com (NFS_Server.example.com IP): icmp_seq=4 ttl=63 time=1.93 ms

    --- NFS_Server.example.com ping statistics ---
    4 packets transmitted, 4 received, 0% packet loss, time 3003ms
    rtt min/avg/max/mdev = 1.928/1.936/1.945/0.044 ms
    [root@Server_issue ~]#


    Test on other Servers:

    [test Server 1:~]# ping test Server 12
    PING test Server 12 (test Server 12 IP) 56(84) bytes of data.
    64 bytes from test Server 12 (test Server 12 IP): icmp_seq=1 ttl=62 time=0.946 ms
    64 bytes from test Server 12 (test Server 12 IP): icmp_seq=2 ttl=62 time=1.51 ms
    64 bytes from test Server 12 (test Server 12 IP): icmp_seq=3 ttl=62 time=0.695 ms
    64 bytes from test Server 12 (test Server 12 IP): icmp_seq=4 ttl=62 time=0.681 ms
    64 bytes from test Server 12 (test Server 12 IP): icmp_seq=5 ttl=62 time=0.704 ms
    64 bytes from test Server 12 (test Server 12 IP): icmp_seq=6 ttl=62 time=0.735 ms
    64 bytes from test Server 12 (test Server 12 IP): icmp_seq=7 ttl=62 time=0.660 ms

    --- test Server 12 ping statistics ---
    7 packets transmitted, 7 received, 0% packet loss, time 6004ms
    rtt min/avg/max/mdev = 0.660/0.848/1.517/0.288 ms
    [test Server 1:~]# ping -s 8000 test Server 12
    PING test Server 12 (test Server 12 IP) 8000(8028) bytes of data.
    8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=1 ttl=62 time=1.13 ms
    8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=2 ttl=62 time=1.41 ms
    8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=3 ttl=62 time=1.22 ms
    8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=4 ttl=62 time=1.17 ms
    8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=5 ttl=62 time=2.06 ms
    8008 bytes from test Server 12 (test Server 12 IP): icmp_seq=6 ttl=62 time=1.03 ms

    --- test Server 12 ping statistics ---
    6 packets transmitted, 6 received, 0% packet loss, time 5001ms
    rtt min/avg/max/mdev = 1.031/1.340/2.065/0.346 ms
    [test Server 1:~]#

    6.The system was hung due to resource crunch. However on further looking into the issue I found that this situation occurs due to Hung Tasks. These tasks are mainly waiting to write in NFS Share, then get into uninterruptible  sleep (D). Which is chunking up the system Resources.

    Resolution/Suggestion:
    1.   There is a NFS share on this system (/u02) which is mounted from NFS_Server.example.com:/backup . Is this being called periodically for some purpose?
    2.   If that is the case is it possible to tune these jobs so that these tasks can be distributed over time to decrease load on certain time frame.

    No comments