Best Practices for Performance in ISA Server 2006

Best Practices for Performance in ISA Server 2006.

Up to 90 Mbps
Processors/Cores: 2/2
Processor type: Xeon Dual Core, AMD Dual Core, 2.0–3.0 GHz
Memory: 2 GB
Disk space: 100/1000 Mbps

According to outbound firewall test results, ISA Server running on a single Pentium 4 2.4-GHz processor can provide a throughput of approximately 25 Mbps at 75 percent CPU utilization.
Dual Xeon 2.4-GHz processors can provide a throughput of approximately 45 Mbps (T3) at 75 percent utilization of the CPU.

For large enterprise-scale sites with over 500 users, the situation is more complex. This case requires more elaborate planning, because Internet bandwidth is large enough to shift the performance bottleneck to the system’s CPU resource.

Determining CPU and System Architecture Capacity

  • CPU speed. As in most applications, ISA Server benefits from faster CPUs. However, increasing CPU speed does not ensure a linear increase in performance. Due to the large and frequent memory access effect, increasing CPU speed may cause more wasted idle CPU cycles when waiting for memory.
  • L2/L3 cache size. Dealing with large amounts of data requires frequent memory access. An L2/L3 cache improves the performance on large memory fetches.
  • System architecture. Because ISA Server transfers large data loads between network devices, memory, and the CPU, the system elements around the CPU also have an effect on ISA Server performance. A faster memory front side bus and faster I/O buses improve overall capacity.

 

CPU bottlenecks are characterized by situations in which\Processor\% Processor Time performance counter numbers are high while the network adapter and disk I/O remain well below capacity. In this case, (which is the ideal CPU-maximized system)

Determining Memory Capacity

ISA Server memory is used for:

  • Storing network sockets (mostly from a nonpaged pool)
  • Internal data structures
  • Pending request objects

In Web proxy caching scenarios, memory is also used for:

  • Disk cache directory structure
  • Memory caching

Physical memory (MB)

4,096
Minimum nonpagedpool size: 128
Maximum nonpagedpool size: 256

If Web caching is disabled, you must determine if more physical memory is needed by monitoring the memory used by all processes in the system. The following performance counters will assist you: 

\Memory\Pages/sec\Memory\Pool Nonpaged Bytes\Memory\Pool Paged Bytes\Process(*)\Working Set

The installation of ISA Server 2006 on Microsoft Virtual Server 2005 R2 is supported. 
Because the Windows operating system that hosts Virtual Server cannot be protected by ISA Server on a virtual server, ISA Server in a Virtual Server environment should not be used in an edge firewall scenario, 
and this configuration is not supported. You can use this configuration securely in other scenarios, such as:

If you encounter high \Process\wspsrv\Virtual Bytes performance counter values (values of 1,800,000,000 (1.8 GB) indicate that there may be a problem).

Determining Disk Storage Capacity

ISA Server uses disk storage for:

  • Logging firewall activity
  • Web caching

If both are disabled or if there is no traffic, ISA Server will not perform any disk I/O activity. In a typical setup of ISA Server, logging is enabled and configured to use Microsoft SQL Server™ 2005 Desktop Engine (MSDE 2005) logging. For most deployments, a single disk is enough to serve the maximum logging rate. If Web caching is enabled, disk storage capacity must be planned carefully as explained in Web Caching in this document.

Usually, the limit is between 100 to 200 accesses per second. The performance counter to use for monitoring the disk access rate is:\PhysicalDisk(*)\Disk Transfers/secIf this limit is reached on a disk for a sustained period of time, you can expect the system to slow, which you will notice by an increase in system response time. To remove this bottleneck, the immediate solution is to lower disk accesses by adding more physical disks.

Another cause for a high disk access rate is hard page faults. For troubleshooting this situation, see Web Caching in this document.

The following table provides an estimate of the transaction rate and log bandwidth for the three Internet link bandwidths.

 

Internet link bandwidth

25 Mbps

45 Mbps

90 Mbps

SQL transactions per second

625

1,125

3,250

SQL transaction bandwidth

2.3 Mbps

4.2 Mbps

12.1 Mbps

For larger bandwidths, the numbers in the preceding table can be extrapolated linearly.

hardware recommendations for supporting HTTP traffic

 

Internet link bandwidth     Up 45 Mbps Up to 90 Mbps

Processors/Cores

2

2/2

Processor type

Xeon 2.0–3.0 GHz

Xeon Dual Core

AMD Dual Core

2.0–3.0 GHz

Memory

1 GB

2 GB

Disk space

5 GB

10 GB

Network interface

100/1000 Mbps

100/1000 Mbps

Cache object hit ratio is the proportion of objects that are served from the cache out of the total objects that are served by the proxy. Likewise, cache byte hit ratio is the proportion of bytes that are served from the cache out of the total bytes that the proxy serves. Common average values are approximately 35 percent object hit ratio and approximately 20 percent byte hit ratio.

Suppose a static object requires X CPU cycles, and a dynamic object requires 4X cycles. If 80 out of 100 requests are static, the total number of cycles required for 100 requests is 80X + (100-80)4X = 160X, and 50 percent of those utilized for static content will be served by an ISA Server cache.

Tuning Forward Cache Memory and Disks

Number_of_Disks = (Peak_request_rate X Object_hit_ratio) / 100

For example, if peak request rate is 900 requests per second and object hit ratio is 35 percent, four disks are required.

The number 100 in the preceding formula is empirical and means that the average performing physical disk (spinning up to 10,000 revolutions per minute) can serve 100 I/O operations per second. A faster disk spinning at 15,000 revolutions per minute can do 130—140 I/O operations per second.

  • Pending request objects. The number of pending request objects is proportional to the number of client connections to the ISA Server computer. In most cases, it will be less than 50 percent of client connections. Each pending request requires approximately 15 KB. For 10,000 simultaneous connections, the Web proxy memory working set has no more than 50% × 10,000 × 15 KB = 75 MB allocated for pending request objects.However in an RPC over HTTP or HTTPS publishing scenario, all connections have a pending request object. Following the previous example, a total of 100% × 10,000 × 15 KB = 150 MB is allocated for pending request objects.
  • Cache directory. The directory containing a 48-bytes entry for each cached object. The size of the cache directory is directly determined by the size of the cache and the average response size. For example, a 50-GB cache holding 7,000,000 objects (approximately 7 KB each on average) requires 48 × 7,000,000 = 336 MB.
  • Memory caching. The purpose of memory caching is to serve requests for popular cached objects directly from memory lowering disk cache fetches. But because cacheable content is unlimited in forward caching, the memory cache size has a limited effect on performance.

By default, the memory cache is 10 percent of total physical memory, and is configurable. In general, we recommend using the default setting unless hard page faults occur. 

\ISA Server Cache\Memory Cache Allocated Space (KB)\ISA Server Cache\Memory URL Retrieve Rate (URL/sec)\ISA Server Cache\Memory Usage Ratio Percent (%)\ISA Server Cache\URLs in Cache\Memory\Pages/sec\Memory\Pool Nonpaged Bytes\Memory\Pool Paged Bytes\Process(WSPSRV)\Working Set\TCP\Established Connections

The size of the disk and memory cache is recommended to be approximately twice the size of the working set to hold all cacheable objects, and to account for fragmentation in disk allocation and cache refresh policy. For example, a working set of 500 MB requires 1,000-MB disk cache and 1,500-MB memory with memory cache size set to 66 percent.

Because most cache fetches are served from the memory cache, the I/O rate on the disk is low. In most cases, a single physical disk is sufficient, without being a bottleneck.

Web Authentication

There are many methods for performing Web authentication, and each has its own performance impact. The following table summarizes the advantages and disadvantages of each method.

Authentication scheme Strength When authentication is performed Overhead perrequest Overhead per batch

Basic

Low

Per time

Low

None

Digest

Medium

Per time/count

None

High

NTLM

Medium

Per connection

None

High

NTLMv2

High

Per connection

None

High

Kerberos

High

Per connection

None

Medium

SecurID

High

Per browser session

None

Medium

RADIUS per request

High

Per request

High

None

RADIUS per time out (default)

Medium

Per time

Low

None

Outlook Web Access

When a Web client connects to an Outlook Web Access Exchange Server front-end server, it loads the Outlook Web page that contains the user-interface icons and headers of messages currently in the mailbox. Subsequently, any operation that the user performs (such as Open, Send, or Move to Folder) generates a new HTTP connection that transfers an average of 10 to 20 kilobytes (KB). When accumulating the behavior of Outlook Web Access over many users, the Web client typically creates a relatively low bits per connection value (such as 100 kilobits per connection).

RPC over HTTP with Outlook 2003 Cached Exchange Mode

Remote procedure call (RPC) over HTTP is a feature of Microsoft Exchange Server 2003 that enables Outlook 2003 clients to access an Exchange server in the Internal corporate network from the Internet. When connecting to Exchange Server, an Outlook 2003 client working in Cached Exchange Mode typically starts with a synchronization of mailbox content with a local cache file. After the synchronization is complete, intermittent connections occur, in which new messages are transferred. For a knowledge worker using a heavy usage profile, the synchronization operation transfers many bytes of data over a small number of connections, so the overall characteristic bits per connection value is rather high (such as 500 kilobits per connection).

Bb794835.note(en-us,TechNet.10).gifNote:
Each RPC over HTTP client establishes approximately 10 connections, so you should also consider the total amount of connections (number of clients x 10) when planning your deployment.

 

Determining SSL Capacity

\TCPv4\Connections Active.

  1. Kilobits per connection 100 (Outlook Web Access) 200(Web) 500(RPC over HTTP)

    1 processor, SSL to HTTP

    91

    77

    69

    1 processor, SSL to SSL

    120

    96

    83

    2 processors, SSL to HTTP

    128

    104

    91

    2 processors, SSL to SSL

    142

    120

    104

 A dual processor computer with two Intel 2.4-GHz Pentium 4 processors requires 120 megacycles per megabit or 120 × 15 = 1800 megacycles for 15 megabits per second and is used at 1800 / (2 × 2400) = 38% at peak throughput.

The following table shows the amount of traffic in megabits that a 2.4-GHz processor can process at maximum recommended usage (80 percent).

Kilobits per connections 100 200 500

1 processor, SSL to HTTP

21

25

28

1 processor, SSL to SSL

16

20

23

2 processors, SSL to HTTP

30

37

42

2 processors, SSL to SSL

27

32

37

 

Scenario Transparent proxy Forward proxy SSL tunneling

1 processor

74

37

30

2 processors

86

43

35

There are several ways to scale out an ISA Server system:

  • Using high-level network switching hardware gear. These switches are often called L3, L4, or L7 switches (layer 3, layer 4, or layer 7) because they provide switching capabilities based on various information available at different networking layers. L3 switching is based on packet layer information (IP), L4 is based on transport layer information (TCP), and L7 performs switching based on application data (HTTP headers). The information available at these levels can provide sophisticated load balancing, according to IP source or destination addresses, TCP source or destination ports, URL, and content type. Because the switches are implemented as hardware appliances, they have a relatively high throughput, and are highly available and reliable, but also expensive. Most switches can detect server down conditions, enabling fault tolerance.
  • Using DNS round-robin name resolution. A cluster of servers can be assigned the same name in the Domain Name System (DNS). DNS responds to queries for that name by cycling through the list. This is an inexpensive (no cost) solution, but has drawbacks. One problem is that the load is not necessarily distributed evenly between servers in the cluster. Another problem is that it provides no fault tolerance.
  • Using Windows Network Load Balancing. Network Load Balancing (NLB) works by sharing an IP address with all the servers in a cluster, and all data sent to this IP address is viewed by all servers. However, each packet is served by only one of the servers, according to some shared hash function. NLB is implemented at the operating system level. It provides evenly distributed load balancing and supports fault tolerance. (Other servers in the cluster can detect a failing server and distribute its load between them.) However, it requires CPU processing overhead (approximately 10 to 15 percent for common ISA Server scenarios), and has a limit to the number of members in the cluster (approximately 8 computers as the recommended maximum). For more information about how to deploy NLB, see “Network Load Balancing Integration Concepts for Microsoft Internet Security and Acceleration (ISA) Server 2006” at the Microsoft TechNet Web site.
  • Using Cache Array Routing Protocol. For the caching scenarios, ISA Server supports the Cache Array Routing Protocol (CARP), which is a cache load balancing protocol. It not only distributes the load between the servers, it also distributes the cached content. Each request is sent to a specific computer in the cluster, so that subsequent hits are served from that computer.

 

Features Hardware switch Windows NLB DNS round-robin CARP

Scale factor

2

1.75 for Web traffic

1.9 for SSL and VPN remote access

2

Starting from 1.5, and asymptotically approaching 2

System cost

Expensive

No added cost

No added cost

No added cost

Fault tolerance

Depends on switch (most detect failing computer and load the others)

By mutual detection of failing computer

None

By mutual detection of failing computer

Scenario

All

All

All

Forward caching only

 

Scenario Single Pentium 4 Dual Xeon Xeon Dual ProcessorDual Core AMD Dual Processor Dual Core

Transparent Web proxy

74

86

62

36

Forward Web proxy

37

43

32

18

Stateful filtering

8

10

9

5

SSL tunneling

30

35

38

40

SSL – SSL to HTTP

Scenario Single Pentium 4 Dual Xeon Xeon Dual ProcessorDual Core AMD Dual Processor Dual Core

Outlook Web Access

91

128

91

51

Web

77

104

72

40

RPC over HTTP

69

91

64

35

The numbers in the preceding table were obtained using the following assumptions:

  • MSDE logging is used.
  • No Web authentication is performed.
  • HTTP Web filter is enabled with default settings.
  • ISA Server is loaded with characteristic Web traffic.
  • ISA Server hardware is tuned as described in Tuning Hardware for Maximum CPU Utilization in this document.
  • The next table provides NLB scale factors to be used when applying NLB scale-out for increased capacity.
    Number of NLB array members

    Scale factor

    2

    3

    4

    5

    6

    7

    8

    1.9

    1.053

    1.085

    1.108

    1.126

    1.142

    1.155

    1.166

    1.75

    1.143

    1.236

    1.306

    1.363

    1.412

    1.455

    1.493

Megacycles/megabit = 353 × 10% + 128 × 20% + 86 × 35% + 43 × 35% = 107

The total amount of megacycles per second required for 80 megabits per second is 80 × 107 = 8560.

Factored megacycles/megabit assuming a two member array = 115% × (353 × 10% × 1.053 + 128 × 20% × 1.053 + 86 × 35% × 1.143 + 49 × 35% × 1.143) = 136

Factored megacycles/megabit assuming a three member array = 115% × (353 × 10% × 1.085 + 128 × 20% × 1.085 + 86 × 35% × 1.236 + 49 × 35% × 1.236) = 143

The resulting total megacycles per second required is 80 × 143 = 11440. Three dual processor 3-GHz computers provide 13500 megacycles per second at 84 percent processor utilization. This is enough to support this load and provides some space for growth.

 

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s