Wednesday, November 5, 2014

VMware Snapshot causing database connection to error out "Connection reset by peer"

Every time my customer performs VMware Snapshot on VMs connecting to the Oracle database in a Linux physical box, it triggers Oracle database connection resets. After a few weeks of running RDA, OSWatcher, listener.log and etc, Customer administrator found out that someone set the net_ipv4.tcp_retries2 parameter to a very low value. Default value is 15 which is roughly 13 - 30 minutes of timeout. In this case, someone setting the tcp_retries2 to 3, that translates into 3-5 minutes before connection is reset and if VMware Snapshot taking longer than that period of time due to "stun", it would reset the connection to the database.

There is absolutely no reason to set the tcp_retries2 to a really low number.