High Availability Developers Guide
Introduction
Newtek Gateway provides a variety of tools and strategies to provide merchants with as close to 100% continuous processing of payments as possible. This guide provides developers with the background information and examples necessary to leverage these tools and maximize the availability of payment processing.
System Architecture
Newtek Gateway operates multiple data centers across the country. Each data center is equipped with redundant power backup and generation, cooling, fire suppression, and network connections. In the event of a data center losing grid power, UPS and on site generators ensure continuous power. The data centers are geographically dispersed to prevent downtime due to natural disaster. The data centers are connected to multiple Internet backbone providers to mitigate peering issues between providers. They house all resources necessary to process payments, independent of any other location. During normal operation, all data is replicated between locations and all locations can utilize resources such as platform connections out of other locations. In the event of a major outage such as complete power loss in one location, the other locations can operate completely independently.
Within each data center, servers are operated in high-availability (HA) clusters. During normal operation, traffic is load balanced across all servers in the cluster to ensure highest possible performance. The cluster continuously monitors each server for availability. In the event a server goes off line for any reason (planned maintenance, hardware failure, software misconfiguration) the cluster automatically stops routing traffic to the server. Whenever possible, servers are engineered with redundant components such as multiple power supplies and hard drives in raid configurations.
Server and data center maintenance is always done in a non-invasive manner that does not affect transaction processing.
URLS
Default: secure.newtekgateway.com
Currently the default url is setup to direct traffic the primary processing location '03'. In the event of an outage at '03', DNS is updated to route these urls to primary location '01' or '02'. The difference between secure.newtekgateway.com and secure.newtekgateway.com is the type of SSL certificate used. secure.newtekgateway.com uses an "unchained" 2-year Verisign certificate. This certificate should work with the widest range of ssl libraries including those that do not support chained certificates. Secure.newtekgateway.com uses an extended validation (EV) ssl certificate. This provides the green bar in modern web browsers but causes some issues with certificate validation in older SSL libaries.
Primary Processing Location: secure.newtekgateway.com
Use | URL |
---|---|
Login | https://secure.newtekgateway.com/login |
Transaction API | https://secure.newtekgateway.com/gate |
SOAP API | https://secure.newtekgateway.com/soap/gate |
Ping | https://secure.newtekgateway.com/ping |
This url sends traffic directly to the primary processing location '01'. It is recommended that this is the first backup url that developers try. This location has all resources necessary to operate independently of the other primary location.
Primary Processing Location: secure-02.newtekgateway.com
Use | URL |
---|---|
Login | https://secure-02.newtekgateway.com/login |
Transaction API | https://secure-02.newtekgateway.com/gate |
SOAP API | https://secure-02.newtekgateway.com/soap/gate |
Ping | https://secure-02.newtekgateway.com/ping |
This url sends traffic directly to the primary processing location '02'. It is recommended that this is the second backup url that developers try. This location has all resources necessary to operate independently of the other primary location.
Primary Processing Location: secure.newtekgateway.com
Use | URL |
---|---|
Login | https://secure.newtekgateway.com/login |
Transaction API | https://secure.newtekgateway.com/gate |
SOAP API | https://secure.newtekgateway.com/soap/gate |
Ping | https://secure.newtekgateway.com/ping |
This url sends traffic directly to the primary processing location '03'. During normal operation, this url is identical to the default secure.newtekgateway.com url.
Testing Connectivity
Newtek Gateway does not allowing 'pinging' its servers. ICMP Echo requests (Ping) are dropped at the edge firewalls to prevent network probing and simplistic DDOS attacks. Ping requests sent to any of our servers will result in time out messages. Ping timeouts do not mean that the server is not available. The following example is the normal expected output:
PING secure.newtekgateway.com (64.0.146.104): 56 data bytes
-
-- secure.newtekgateway.com ping statistics ---
10 packets transmitted, 0 packets received, 100% packet loss
To test your ability to connect to a given url, access the ping urls listed above. They will respond with a state and cluster id. As long as the response you receive starts with the string "UP" then the url is available for use:
# curl https://secure.newtekgateway.com/ping
UP:ca403
The string "DOWN" will appear if the datacenter is not recommended for use. The majority of the time the url will still be able to accept transactions. The DOWN flag is used to indicate planned maintenance where there is the potential for a disruption of service.
The second part of the string indicates which cluster in which datacenter you are connecting to. For example "ca403" means that you were routed to cluster 3 in datacenter ca4.
Firewall Rules
If your network is utilizing outbound firewall rules to restrict which IPs can be connected to, you should create outbound rules for all IPs that might be used. Only ports 80,443 and 4443 are used by Newtek Gateway.
Host | IPs |
---|---|
secure.newtekgateway.com | 64.0.146.104, 209.220.191.104, 209.239.233.104, 65.132.197.104 |
secure.newtekgateway.com | 64.0.146.8, 209.220.191.8, 209.239.233.8, 65.132.197.8 |
newtekgateway.com | 64.0.146.100, 209.220.191.100, 209.239.233.100, 65.132.197.100 |
secure.newtekgateway.com | 64.0.146.9, 209.239.233.105 |
secure-02.newtekgateway.com | 209.220.191.104, 64.0.146.1046 |
secure.newtekgateway.com | 209.239.233.9 |
secure.newtekgateway.com | 65.132.197.104 |
sandbox.newtekgateway.com | 64.0.146.209, 209.239.233.129 |
Redundancy Strategies
Passive, DNS Failover
This "strategy" is actually the default behavior of the primary secure.newtekgateway.com and secure.newtekgateway.com urls. If your server/workstation is using DNS to resolve the default URLs you will receive the "default" datacenter (currently CA4 - 209.239.233.*). If there is maintenance or an outage, our DNS servers will automatically start resolving to one of the other two datacenters. Currently we are configured to automatically failover within 15 seconds of a failure. The time to live (TTL) flag on our DNS records is set to 3 minutes. Unfortunately your ISP's DNS servers (or even your server or browser) may cache the old DNS entry for longer, leading to longer failover times for some users. In these cases it might be necessary to manually override your dns server through the use of a host file entry. On a UNIX server this can be done by editing /etc/hosts and on windows by editing C:\Windows\System32\drivers\etc\hosts (or equivalent). To force secure.newtekgateway.com to go to primary '02', you would add the line:
209.220.191.104 secure.newtekgateway.com
To "implement" this strategy, merchants and customers simply need to retry the transaction every few minutes until the DNS has been updated. The only recommended coding change that developers might consider are the proper setting of the timeout variable and catching the time out error:
.NET DLL
Try
newtek.Timeout = 60
newtek.Sale()
Catch ex As Exception
If ex.Message = "Error writing to the gateway: Unable to connect to the remote server" Then
errormessage = "Unable to process payment, please try again in a few minutes"
Else
errormessage = ex.Message
End If
End Try
PHP Library
if($tran->Process())
{
echo "<b>Card approved</b><br>";
echo "<b>Authcode:</b> " . $tran->authcode . "<br>";
echo "<b>AVS Result:</b> " . $tran->avs_result . "<br>";
echo "<b>Cvv2 Result:</b> " . $tran->cvv2_result . "<br>";
} else if($tran->curlerror == 'connect() timed out!') {
echo "<b>Unable to process payment, please try again in a few minutes<br>";
} else {
echo "<b>Card Declined</b> (" . $tran->result . ")<br>";
echo "<b>Reason:</b> " . $tran->error . "<br>";
if($tran->curlerror) echo "<b>Curl Error:</b> " . $tran->curlerror . "<br>";
}
Active Failover, Backup URL Retry
Using this strategy, any failures to connect to the primary url are automatically retried on a secondary url. This method is useful in applications that do not have reliable network connections such as mobile internet solutions. In the event that the initial connection to the gateway fails, the developer will trap the error and automatically retry again on the second url. This process can be repeated for all urls if the developer chooses to.
If choosing this strategy it is important to consider the duplicate transaction problem. Many developers set the connection timeout too low and end up giving up before the gateway has finished processing. While it is rare, some processing backends can take as much as 120 seconds to respond. For example if a developer has their timeout set to 30 seconds and the gateway takes 45 seconds to complete an approval. The application would have returned a time out error even though the gateway approved the transaction and placed it in the batch. The application then retries the transaction on the backup url where it is again approved and placed in the batch. There are now two transactions on the gateway even though the application has only recorded one. While the obvious solution is to raise the application timeout, this can lead to customers giving up and retrying the transaction on their own.
There are two ways to deal with this problem. The first, and easier method is to use the duplicate folding functionality. Duplicate folding will check all incoming transactions for duplicates. If a duplicate is detected, the original transaction response details will be returned instead of processing the transaction again. In the scenario where the first transaction times out (but is authed on the gateway) and then retried on the backup url, the second call to the backup url will detect the duplicate and return the details that would have been returned on the first call if the connection hadn't been dropped prematurely.
When using this method its important to be careful that intentional duplicate charges are not accidentally folded. For example if a customer decides to buy the same product for the same amount on the same card.
Connection Scoreboard for Load Balancing and Failover
Another strategy that works particular well for high traffic, multi-threaded applications is to maintain a connection scoreboard. The scoreboard keeps track of the number of open connections, hits (successful transactions) and errors for each url. This data is then used to select the best url to send the next transaction to. During normal operation, transactions will load balance between the primary urls. During an outage, after the first failure is recorded, all other transactions will automatically route to the other primary urls.
Example scoreboard:
URL | Working | Hits | Errors | Last Error |
---|---|---|---|---|
secure.newtekgateway.com | 0 | 100 | 0 | |
secure-02.newtekgateway.com | 1 | 100 | 0 |
In the above example, both links have successfully processed 100 transactions and we are currently in the middle of processing a transaction on www-02. The logic for selecting the next URL is to pull the url with the lowest errors, lowest working and lowest hits. In the above example we would select www-01 as the next url since it is currently idle (working=0) and has the same number of errors (0) as www-02.
Once an error occurs on one of the connections, the error counter will be increased:
URL | Working | Hits | Errors | Last Error |
---|---|---|---|---|
secure.newtekgateway.com | 0 | 101 | 1 | 2010-01-01 11:12:59 |
secure-02.newtekgateway.com | 1 | 2310 | 0 |
Since we are sorting first by error count, the next url will now be www-02 because www-01 has a higher error count. Traffic will continue to go to www-02 until the error count is cleared. For this reason, error counts should be cleared periodically. This can be done by setting all error counts to 0 when the last error date is greater than a certain amount of time (ie, 60 minutes).
MySQL/PHP Example
The following is a "proof of concept" using php and a mysql database. The same thing should be possible in any language as long as you have the ability to share information between threads, sessions, application instances, users, etc.
SQL Scheme:
CREATE TABLE connections (
url CHAR(6),
working INT,
hits INT,
errors INT,
lasterror DATETIME,
UNIQUE KEY (url)
);
INSERT INTO connections SET url='www-01', working=0, hits=0, errors=0;
INSERT INTO connections SET url='www-02', working=0, hits=0, errors=0;
INSERT INTO connections SET url='www-03', working=0, hits=0, errors=0;
PHP Transaction Library
// select url
$res = mysql_query("SELECT url
FROM connections
ORDER BY errors,working,hits
LIMIT 1");
list($url) = mysql_fetch_row($res);
// in case something is wrong with the mysql table
if(!$url) $url='www-01';
// update scoreboard to reflect that we are processing a transaction on this link
mysql_query("UPDATE connections
SET working=working+1
WHERE url='" . mysql_real_escape_string($url) . "'");
$tran->gatewayurl = 'https://' . $url . '.newtekgateway.com/gate';
$res = $tran->Process();
// log error, modify this statement to adjust what you consider a failure
// as is, this considers anything that causes an underlying http error to
// be a gateway failure.
if(!$res && strlen($tran->curlerror)>0)
{
mysql_query("UPDATE connections
SET working=working-1, errors=errors+1, lasterror=now()
WHERE url='" . mysql_real_escape_string($url) . "'");
}
// else log success
else {
mysql_query("UPDATE connections
SET working=working-1, hits=hits+1
WHERE url='" . mysql_real_escape_string($url) . "'");
// automatically clear stale error counts (optional)
mysql_query("UPDATE connections
SET errors=0, lasterror=null
WHERE lasterror<'" . date('Y-m-d H:i:s', strtotime('-30 minutes')) . "'");
}
Pro-Active Failover, URL Monitoring
Using this strategy, the developer keeps a list of processing urls in a database or config file. Each url is then pinged every few minutes. If one fails, it is marked as down or otherwise removed from the list. The payment application is then coded to pull its active gateway url from the database or config file.
Notifications
Newtek Gateway provides real time notification of network issues via our twitter feed.