Monday, August 31, 2015

SEDA: An Architecture for Highly Concurrent Server Applications

https://en.wikipedia.org/wiki/Staged_event-driven_architecture

The staged event-driven architecture (SEDA) refers to an approach to software architecture that decomposes a complex, event-drivenapplication into a set of stages connected by queues. It avoids the high overhead associated with thread-based concurrency models, and decouples event and thread scheduling from application logic. By performing admission control on each event queue, the service can be well-conditioned to load, preventing resources from being overcommitted when demand exceeds service capacity.

SEDA employs dynamic control to automatically tune runtime parameters (such as the scheduling parameters of each stage) as well as to manage load (like performing adaptive load shedding). Decomposing services into a set of stages also enables modularity and code reuse, as well as the development of debugging tools for complex event-driven applications.

http://www.infoq.com/articles/SEDA-Mule
http://berb.github.io/diploma-thesis/original/042_serverarch.html#seda
As a basic concept, it divides the server logic into a series of well-defined stages, that are connected by queues. Requests are passed from stage to stage during processing. Each stage is backed by a thread or a thread pool, that may be configured dynamically.

The separation favors modularity as the pipeline of stages can be changed and extended easily. Another very important feature of the SEDA design is the resource awareness and explicit control of load. The size of the enqueued items per stage and the workload of the thread pool per stage gives explicit insights on the overall load factor. In case of an overload situation, a server can adjust scheduling parameters or thread pool sizes. Other adaptive strategies include dynamic reconfiguration of the pipeline or deliberate request termination. When resource management, load introspection and adaptivity are decoupled from the application logic of a stage, it is simple to develop well-conditioned services. From a concurrency perspective, SEDA represents a hybrid approach between thread-per-connection multithreading and event-based concurrency. Having a thread (or a thread pool) dequeuing and processing elements resembles an event-driven approach. The usage of multiple stages with independent threads effectively utilizies multiple CPUs or cores and tends to a multi-threaded environment. From a developer's perspective, the implementation of handler code for a certain stage also resembles more traditional thread programming.

The drawbacks of SEDA are the increased latencies due to queue and stage traversal even in case of minimal load. In a later retrospective [Wel10], Welsh also criticized a missing differentiation of module boundaries (stages) and concurrency boundaries (queues and threads). This distribution triggers too many context switches, when a requests passes through multiple stages and queues. A better solution groups multiple stages together with a common thread pool. This decreaes context switches and improves response times. Stages with I/O operations and comparatively long execution times can still be isolated.

The SEDA model has inspired several implementations, including the generic server framework Apache MINA and enterprise service buses such as Mule ESB.

http://stackoverflow.com/questions/3570610/what-is-seda-staged-event-driven-architecture

Stage is analog to "Event", to simplify the idea, think SEDA as a series of events sending messages between them.

One reason to use this kind of architecture, I think, is that you fragment the logic and can connect it and decouple each event, mainly for high performance services with low latency requirements fits well.

If you use Java TPE, you could monitor the health, throughput, errors, latency of each stage, and quickly find where is the performance bottleneck. And as a nice side effect, with smaller pieces of code, you can easily test them and increment your code coverage (that was my case).

For the record, this is the internal architecture of Cassandra (NoSQL), and Mule ESB (AFAIK).

https://www.quora.com/Design-Patterns/What-is-Staged-Event-Driven-Architecture-SEDA
SEDA seems to be dead.

A modern take on this, avoiding the infamous "one thread per socket" architecture which produces a lot throttling, is the node.js and vert.x architectures, which are based on callbacks and internal OS mechanisms, which happen to be queues too.

http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda.html
http://www.slideshare.net/planetcassandra/cassandra-summit-2014-monitor-everything

DynamoDB Internal

A good example to choose a hash and range type primary key for the Person table would be choosing birth years as the hash key and SSN as the range key.

If we don't create secondary indexes, the only option to get the item for a certain non-primary key attribute is to scan the complete table.

you need to create secondary indexes at the time of table creation itself, and you cannot add indexes to already existing tables. Also, DynamoDB does not allow editing or deleting indexes from a given table.

Local secondary index
local secondary indexes give you more range query options other than your table range key attribute. So to define the local secondary index, we can say that it is an index which has the same hash key as a table but a different range key.

A local secondary index must have both hash and range keys.
The hash key of the local secondary index is the same as that of the table.
The local secondary index allows you to query data from a single partition only, which is specified by a hash key. As we know DynamoDB creates partitions for unique hash values, the local secondary index allows us to query non-key attributes for a specific hash key.
The local secondary index supports both eventual and strong consistency, so while querying data, you can choose whichever is suitable.

Global secondary index
The global name suggests a query search on all table partitions compared to a single partition in the case of the local secondary index. Here, we can create a new hash key and an optional range key, which is different than the table hash and range keys to get the index working.

The global secondary index should have a hash key and an optional range key.
The hash and range keys of a global secondary index are different from table hash and range keys.
The global secondary index allows you to query data across the table. It does not restrict its search for a single data partition; hence, the name global.
The global secondary index eventually supports only consistent reads.
The global secondary index maintains its separate read and write capacity units, and it does not take read and write capacity units from the table capacity units.
Unlike the local secondary index, global ones do not have any size limits.

Data types
Scalar data types (string, number, and binary)
Multivalued data types (string set, number set, and binary set)

Items in DynamoDB are simply collections of attributes. Attributes can be in the form of strings, numbers, binaries, or a set of scalar attributes. Each attribute consists of a name and a value. An item must have a primary key. A primary key can have a hash key or a combination of hash and range keys. In addition to the primary key, items can have any number of attributes except for the fact that item size cannot exceed 64 KB.

Parameter	Local secondary index	Global secondary index
Hash and range keys	Needs both hash and range keys. The index hash key is the same as the table hash key.	Needs hash key and optional range key. Index hash and range keys are different than those of table keys.
Query scope	Limited to single partition data only.	Queries over complete table data, which means you can also query on other hash keys that are not part of table hash keys.
Consistency	Provides option to select either eventual or strong consistency.	Supports only eventual consistency.
Size	The size of all items together for a single index should be less than 10 GB per hash key.	No size limit.
Provisioned throughput	Uses the same provisioned throughput as specified for a table.	Has a different calculation for provisioned throughput. We need to specify the read and write capacity units at the time of index creation itself.

Strong versus eventual consistency

Conditional writes
AWS supports atomic counters, which can be used to increment or decrement the value as and when needed. This is a special feature that handles the data increment and decrement request in the order they are received. To implement atomic counters, you can use the ADD operation of the UpdateItem API. A good use case to implement atomic counters is website visitor count.

Item size
One good practice to reduce the item size is to reduce the size of the attribute name/length. For example, instead of having the attribute name as yearOfPublishing, you should use the acronym yop.

Query versus scan

From a performance point of view, a query is much more efficient than the scan operation, as a query works on a limited set of items, while the scan operation churns the entire table data. The scan operation first returns the entire table data and then applies filters on it to remove the unnecessary data from the result set. So, it's obvious that as your table data grows, the scan operation would take more time to give back the results.

The query operation's performance is totally dependent on the amount of data retrieved. The number of matching keys for a given search criteria decides the query performance. If a specific hash key has more matching range keys than the size limit of 1 MB, then you can use pagination where an ExclusiveStartKey parameter allows you to continue your search from the last retrieved key by an earlier request. You need to submit a new query request for this.

Query results can be either eventually consistent or optionally strongly consistent, while scan results are eventually consistent only

Pagination
DynamoDB provides us with two useful parameters, LastEvaluatedKey and ExclusiveStartKey, which allow us to fetch results in pages. If a query or scan result reaches the maximum size limit of 1 MB, then you can put the next query by setting ExclusiveStartKey derived from LastEvaluatedKey. When DynamoDB reaches the end of search results, it puts LastEvaluatedKey as null.

Parallel scan

It uses a consistent hashing algorithm to achieve uniform data partitioning. Object versioning is done for consistency. The quorum technique is used to maintain consistency amongst the replicas.

DynamoDB does not have any master node or superior node that would control everything. Rather, it maintains a ring structure of nodes, where each node would have equal responsibilities to perform.

DynamoDB does not have any master node or superior node that would control everything. Rather, it maintains a ring structure of nodes, where each node would have equal responsibilities to perform.

Design features

Data replication
replicas would be updated asynchronously by a background process.

Conflict resolution
it uses an always writable strategy, allowing writes all the time. This is a crucial strategy from Amazon's business point of view, as they don't want people to wait for some write to happen until the conflict is resolved.

When it comes to the database resolving conflicts, it prefers to use the last write wins strategy. In the case of Amazon, you are given the choice to choose your own conflict resolution by providing features such as conditional writes.

Scalability
Symmetry
Flexibility

Load balancing

The coordinator node first keeps the key on the local machine and then replicates it to N minus 1 successor machine in the clockwise direction

Handling failures
Temporary failures ==> hinted handoff

Permanent failures ==> Merkle tree

Merkle tree
DynamoDB uses Merkle tree to maintain the replica synchronization. Comparing all existing replicas and updating replicas with the latest changes is called AntiEntropy.

A Merkle tree is an algorithm used to store and compare objects.
In a Merkle tree, the root node contains the hash of all children, and if the hash values of the root nodes of two trees are the same, then it means those two trees are equal.

In the case of DynamoDB, we create a Merkle tree of the replica on each node and compare them. If the root hashes of trees are the same, then it means the replicas are in sync, whereas if the root hash is not the same, then that means that the replicas are out of sync, and then you can compare the next child nodes and find out the actual discrepancy.

Each DynamoDB node maintains a Merkle tree for each and every key range it has. Doing this, it allows DynamoDB to check whether certain key ranges are in sync or not.
If it finds any discrepancy, then child-wise traversal is done to find the cause of the discrepancy

This technique of replica synchronization is the same in Cassandra and Riak as well.

Seed nodes
DynamoDB keeps seed nodes that would have static information about the cluster. Some nodes from the cluster play the role of seed nodes. Seed nodes have all the information about the membership, as the information is derived from an external service. All nodes ultimately reconcile the membership information with seed nodes

Each DynamoDB node consists of the following components:
Request coordinator

Membership and failure detection

Local persistent store (storage engine)
DynamoDB's local persistence store is a pluggable system where you can select the storage depending upon the application use. Generally, DynamoDB uses Berkeley DB or MySQL as the local persistence store. Berkeley DB is a high-performance embedded database library used for key-value pair type of storage. The storage is selected depending upon the application object size. Berkeley DB is used where object size has a maximum limit of 10 KB, while MySQL is used for application object size expected to be on the higher side.

To avoid creating hot partitions in DynamoDB, it is recommended to append a random number to the hash key.

Managing time series data
You can choose the order ID as the hash key and the date/time as the range. - not good
it is recommended to create tables based on time range, which means creating a new table for each week or month instead of saving all data in the table. This strategy helps avoid the creation of any hot or cold partitions. You can simply query data for a particular time range table itself. This strategy also helps when you need to purge data where you can simply drop the tables you don't wish to see any more. Alternatively, you can simply dump that data on AWS S3, as flat files, which is a cheap data storage service from Amazon.

Storing large attribute values
Using compressions

Implementing one-to-many relationship
Table – UserCreds
Table – User

Indexing should be done for the tables that do not get heavy writes as maintaining those indexes is quite costly.
If a required attribute is not projected in the index, then DynamoDB fetches that attribute from the table, which consumes more read capacity units.

Purging of unnecessary items from the table: This would also delete the respective items from the index as well.
Updating unnecessary items to remove the projected attributes from the index: This will automatically reduce the size of item collection.
Backup or moving old data to a new table: It is always a good practice to save historical data in a different table.

AWS supports Identity and Access Management (IAM) as a service.

There is a limit of five local and five global secondary indexes per table.

There is a maximum limit of 20 attributes on a local secondary index that is created by the user.
For US East Region, a table can scale up to 40,000 read or write capacity units, and for the rest of regions, DynamoDB tables can scale up to 10000 read/write capacity units per table.

http://stackoverflow.com/questions/9752262/dynamodb-database-design-key-value-store-nosql
https://aws.amazon.com/blogs/aws/now-available-global-secondary-indexes-for-amazon-dynamodb/
Each table has a specified attribute called a hash key. An additional range key attribute can also be specified for the table. The hash key and optional range key attribute(s) define the primary index for the table, and each item is uniquely identified by its hash key and range key (if defined). Items contain an arbitrary number of attribute name-value pairs, constrained only by the maximum item size limit. In the absence of indexes, item lookups require the hash key of the primary index to be specified.

The Local and Global Index models extend the basic indexing functionality provided by DynamoDB. Lets consider some use cases for each model:

Local Secondary Indexes are always queried with respect to the table’s hash key, combined with the range key specified for that index. In effect (as commenter Stuart Marshall made clear on the preannouncement post), Local Secondary Indexes provide alternate range keys. For example, you could have an Order History table with a hash key of customer id, a primary range key of order date, and a secondary index range key on order destination city. You can use a Local Secondary Index to find all orders delivered to a particular city using a simple query for a given customer id.
Global Secondary Indexes can be created with a hash key different from the primary index; a single Global Secondary Index hash key can contain items with different primary index hash keys. In the Order History table example, you can create a global index on zip code, so that you can find all orders delivered to a particular zip code across all customers. Global Secondary Indexes allow you to retrieve items based on any desired attribute.

Both Global and Local Secondary Indexes allow multiple items for the same secondary key value.

Local Secondary Indexes support strongly consistent reads, allow projected and non-projected attributes to be retrieved via queries and share provisioned throughput capacity with the associated table. Local Secondary Indexes also have the additional constraint that the total size of data for a single hash key is currently limited to 10 gigabytes.

Global Secondary Indexes are eventually consistent, allow only projected attributes to be retrieved via queries, and have their own provisioned throughput specified separately from the associated table.

As I noted earlier, each Global Secondary Index has its own provisioned throughput capacity. By combining this feature with the ability to project selected attributes into an index, you can design your table and its indexes to support your application’s unique access patterns, while also tuning your costs. If your table is “wide” (lots of attributes) and an interesting and frequently used query requires a small subset of the attributes, consider projecting those attributes into a Global Secondary Index. This will allow the frequently accessed attributes to be fetched without expending read throughput on unnecessary attributes.

https://www.slideshare.net/AmazonWebServices/design-patterns-using-amazon-dynamodb

Apache Server Miscs

http://www.websiteoptimization.com/speed/tweak/cache/
http://metaskills.net/2006/02/19/how-to-control-browser-caching-with-apache-2/

<IfModule mod_expires.c>
  ExpiresActive On
  ExpiresDefault "access plus 1 seconds"
  ExpiresByType text/html "access plus 1 seconds"
  ExpiresByType image/gif "access plus 120 minutes"
  ExpiresByType image/jpeg "access plus 120 minutes"
  ExpiresByType image/png "access plus 120 minutes"
  ExpiresByType text/css "access plus 60 minutes"
  ExpiresByType text/javascript "access plus 60 minutes"
  ExpiresByType application/x-javascript "access plus 60 minutes"
  ExpiresByType text/xml "access plus 60 minutes"
</IfModule>

http://www.askapache.com/htaccess/apache-speed-cache-control/

This code uses the FilesMatch directive and the Header directive to add Cache-Control Headers to certain files.

# 480 weeks
<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf)$">
Header set Cache-Control "max-age=290304000, public"
</FilesMatch>

# 2 DAYS
<FilesMatch "\.(xml|txt)$">
Header set Cache-Control "max-age=172800, public, must-revalidate"
</FilesMatch>

# 2 HOURS
<FilesMatch "\.(html|htm)$">
Header set Cache-Control "max-age=7200, must-revalidate"
</FilesMatch>

If you are using far Future Expires Headers and Cache-Control (recommended), you can do this for these files.

<FilesMatch "\.(ico|pdf|flv|jpg|jpeg|png|gif|js|css|swf)$">
Header set Cache-Control "public"
Header set Expires "Thu, 15 Apr 2010 20:00:00 GMT"
</FilesMatch>

https://www.humboldt.co.uk/the-mystery-of-proxypassreverse/

The mod_proxy_ajp module for Apache has many advantages over mod_jk for connecting a Tomcat server to an Apache front. For me, the crucial advantage was the ProxyPassReverseCookiePath directive, which allows me to map the session cookies of a Tomcat web application (other than the root application) into the root of a virtual host.

Unfortunately, many tutorials contain misleading advice, and recommend this pattern for the ProxyPassReverse, which will break if the web application issues a redirect:

1.ProxyPass /jspdir ajp://localhost:8009/jspdir

2.ProxyPassReverse /jspdir ajp://localhost:8009/jspdir

The purpose of ProxyPassReverse is to rewrite the headers of HTTP redirect responses by a simple string substitution. Unfortunately, when the webapplication sends a redirect, it will send a redirect to a http: URL, not an ajp: URL. This will not match the argument of ProxyPassReverse, so the header will be passed through unchanged.

The working form looks like this, in a more complete example:

01.<VirtualHost *:80>

02.ServerName www.example.com

03....

04.ProxyRequests Off

05.<Proxy *>

06.Order deny,allow

07.Allow from all

08.</Proxy>

09.ProxyPass / ajp://localhost:8009/jspdir/

10.ProxyPassReverse / http://www.example.com/jspdir/

11.ProxyPassReverseCookiePath /jspdir /

12....

13.</VirtualHost>

https://httpd.apache.org/docs/current/mod/mod_proxy.html
ProxyPass

This directive allows remote servers to be mapped into the space of the local server. The local server does not act as a proxy in the conventional sense but appears to be a mirror of the remote server. The local server is often called a reverse proxy or gateway. The path is the name of a local virtual path; url is a partial URL for the remote server and cannot include a query string.

<Location "/mirror/foo/">
    ProxyPass "http://backend.example.com/"
</Location>

will cause a local request for http://example.com/mirror/foo/bar to be internally converted into a proxy request to http://backend.example.com/bar.

The ! directive is useful in situations where you don't want to reverse-proxy a subdirectory, e.g.

<Location "/mirror/foo/">
    ProxyPass "http://backend.example.com/"
</Location>
<Location "/mirror/foo/i">
    ProxyPass "!"
</Location>

ProxyPass "/mirror/foo/i" "!"
ProxyPass "/mirror/foo" "http://backend.example.com"

will proxy all requests to /mirror/foo to backend.example.com except requests made to /mirror/foo/i.

ProxyPassReverse

Adjusts the URL in HTTP response headers sent from a reverse proxied server

path is the name of a local virtual path; url is a partial URL for the remote server. These parameters are used the same way as for the ProxyPass directive.

For example, suppose the local server has address http://example.com/; then

ProxyPass         "/mirror/foo/" "http://backend.example.com/"
ProxyPassReverse  "/mirror/foo/" "http://backend.example.com/"
ProxyPassReverseCookieDomain  "backend.example.com"  "public.example.com"
ProxyPassReverseCookiePath  "/"  "/mirror/foo/"

will not only cause a local request for the http://example.com/mirror/foo/bar to be internally converted into a proxy request to http://backend.example.com/bar (the functionality which ProxyPass provides here). It also takes care of redirects which the server backend.example.com sends when redirecting http://backend.example.com/barto http://backend.example.com/quux . Apache httpd adjusts this to http://example.com/mirror/foo/quux before forwarding the HTTP redirect response to the client. Note that the hostname used for constructing the URL is chosen in respect to the setting of the UseCanonicalName directive.

Useful in conjunction with ProxyPassReverse in situations where backend URL paths are mapped to public paths on the reverse proxy. This directive rewrites the path string in Set-Cookie headers. If the beginning of the cookie path matches internal-path, the cookie path will be replaced with public-path.

ProxyPassReverseCookiePath internal-path public-path
Adjusts the path in Set-Cookie HTTP headers on redirect responses when acting as a reverse proxy in order to avoid the client becoming aware that resources are mirrored.

ProxyPass / balancer://mycluster/rolling stickysession=JSESSIONID|jsessionid nofailover=Off
ProxyPassReverse /rolling /
ProxyPassReverseCookieDomain / rolling.com
ProxyPassReverseCookiePath /rolling /

A traditional HTTP proxy, also called a forward proxy, accepts requests from clients (usually web browsers), contacts the remote server, and returns the responses.

A reverse proxy is a web server that is placed in front of other servers, providing a unified frontend and acting as a gateway. As far as the web browsers are concerned, the reverse proxy is the “real” server, as that is the only one they interact with. The reverse proxy relays requests as necessary to the backend servers.

ProxyPass /crm http://crm.example.com/
ProxyPass /bugzilla
http://backend.example.com/bugzilla
A reverse proxy can provide a unified frontend to a number of backend resources, associating certain URLs on the frontend machine to specific backend web servers.

ProxyPass path url
This command runs on an ordinary server and translates requests for a named directory and below to a demand to a proxy server. So, on our ordinary Butterthlies site, we might want to pass requests to /secrets onto a proxy server darkstar.com:

ProxyPass /secrets http://darkstar.com
ProxyPassReverse path url

A reverse proxy is a way to masquerade one server as another — perhaps because the "real" server is behind a firewall or because you want part of a web site to be served by a different machine but not to look that way. It can also be used to share loads between several servers — the frontend server simply accepts requests and forwards them to one of several backend servers.

Mac Apache httpd conf:

/etc/apache2/

/var/log/apache2

sudo apachectl restart

Hiding the Backend Servers
ProxyPass /crm http://crm.example.com
ProxyPassReverse /crm http://crm.example.com
ProxyErrorOverride On

Sometimes, however, the backend server will issue redirects or error pages that contain references to itself, for example in the Location: header.

The ProxyPassReverse directive will intercept these headers and rewrite them so that they include a reference to the reverse proxy (www.example.com) instead. The ProxyPassReverseCookiePath and ProxyPassReverseCookieDomain directives operate similarly, but on the path and domain strings in Set-Cookie: headers.

The Apache JServ Protocol (AJP) is a binary protocol that can proxy inbound requests from a web server through to an application server that sits behind the web server.

http://northernmost.org/blog/mod_log_forensic-howto/
mod_log_forensic howto
LoadModule log_forensic_module modules/mod_log_forensic.so
LoadModule unique_id_module modules/mod_unique_id.so
ForensicLog logs/forensic_log

http://feitianbenyue.iteye.com/blog/2056357

<Valve className="org.apache.catalina.valves.RemoteIpValve"
remoteIpHeader="X-Forwarded-For"
protocolHeader="X-Forwarded-Proto"
protocolHeaderHttpsValue="https"/>

Enabling the Apache ProxyPreserveHost directive

The ProxyPreserveHost directive is used to instruct Apache mod_proxy, when acting as a reverse proxy, to preserve and retain the original Host: header from the client browser when constructing the proxied request to send to the target server.

The default setting for this configuration directive is Off, indicating to not preserve the Host: header and instead generate a Host: header based on the target server's hostname.

Because this is often not what is wanted, you should add the ProxyPreserveHost On directive to the Apache HTTPD configuration, either in httpd.conf or related/equivalent configuration files.

http://httpd.apache.org/docs/2.4/mod/mod_proxy.html#proxypassreverse

ProxyPass         "/mirror/foo/" "http://backend.example.com/"
ProxyPassReverse  "/mirror/foo/" "http://backend.example.com/"
ProxyPassReverseCookieDomain  "backend.example.com"  "public.example.com"
ProxyPassReverseCookiePath  "/"  "/mirror/foo/"

will not only cause a local request for the http://example.com/mirror/foo/bar to be internally converted into a proxy request tohttp://backend.example.com/bar (the functionality which ProxyPass provides here). It also takes care of redirects which the server backend.example.com sends when redirecting http://backend.example.com/bar tohttp://backend.example.com/quux . Apache httpd adjusts this to http://example.com/mirror/foo/quux before forwarding the HTTP redirect response to the client. Note that the hostname used for constructing the URL is chosen in respect to the setting of theUseCanonicalName directive.

Note that this ProxyPassReverse directive can also be used in conjunction with the proxy feature (RewriteRule ... [P]) frommod_rewrite because it doesn't depend on a corresponding ProxyPass directive.

http://www.akadia.com/services/apache_redirect.html
Proxy Module

ProxyPass

The directive ProxyPass allows remote servers to be mapped into the space of the local server; the local server does not act as a proxy in the conventional sense, but appears to be a mirror of the remote server.

Suppose the local server has address http://wibble.org/; then

   ProxyPass /mirror/foo/ http://foo.com/

will cause a local request for the <http://wibble.org/mirror/foo/bar> to be internally converted into a proxy request to <http://foo.com/bar>.

ProxyPassReverse

The directive ProxyPassReverse lets Apache adjust the URL in the Locationheader on HTTP redirect responses. For instance this is essential when Apache is used as a reverse proxy to avoid by-passing the reverse proxy because of HTTP redirects on the backend servers which stay behind the reverse proxy.

Suppose the local server has address http://wibble.org/; then

   ProxyPass /mirror/foo/ http://foo.com/

   ProxyPassReverse  /mirror/foo/ http://foo.com/

will not only cause a local request for the <http://wibble.org/mirror/foo/bar>to be internally converted into a proxy request to <http://foo.com/bar> (the functionality ProxyPass provides here). It also takes care of redirects the server foo.com sends: when http://foo.com/bar is redirected by him tohttp://foo.com/quux Apache adjusts this tohttp://wibble.org/mirror/foo/quux before forwarding the HTTP redirect response to the client.

Configure tomcat with httpd

https://tomcat.apache.org/tomcat-8.0-doc/proxy-howto.html

mod_proxy_ajp

Include two directives in your httpd.conf file for each web application that you wish to forward to Tomcat. For example, to forward an application at context path /myapp:
```
ProxyPass         /myapp  http://localhost:8081/myapp
ProxyPassReverse  /myapp  http://localhost:8081/myapp
```
which tells Apache to forward URLs of the form http://localhost/myapp/* to the Tomcat connector listening on port 8081.
Configure your copy of Tomcat to include a special <Connector> element, with appropriate proxy settings, for example:
```
<Connector port="8081" ...
              proxyName="www.mycompany.com"
              proxyPort="80"/>
```
which will cause servlets inside this web application to think that all proxied requests were directed to www.mycompany.com on port 80.

HTTP vs AJP in Apache and Tomcat

AJP

Well supported
More compact a protocol
Easily configured in Apache and Tomcat

AJP looks slightly ahead for two reasons:

You’re not opening up HTTP on Tomcat (in fact you should close it for security)
As it’s a more compact protocol, there’s less traffic between the front server and the back server

# within the VirtualHost section - assuming tomcat is on port 8080

ProxyPass / http://localhost:8080/

ProxyPassReverse / http://localhost:8080/

ProxyPass / ajp://localhost:8009/

ProxyPassReverse / ajp://localhost:8009/

The term Virtual Host refers to the practice of running more than one web site (such as company1.example.com andcompany2.example.com) on a single machine. Virtual hosts can be "IP-based", meaning that you have a different IP address for every web site, or "name-based", meaning that you have multiple names running on each IP address. The fact that they are running on the same physical server is not apparent to the end user.

IP-based virtual hosts use the IP address of the connection to determine the correct virtual host to serve. Therefore you need to have a separate IP address for each host. With name-based virtual hosting, the server relies on the client to report the hostname as part of the HTTP headers. Using this technique, many different hosts can share the same IP address.

Name-based virtual hosting is usually simpler, since you need only configure your DNS server to map each hostname to the correct IP address and then configure the Apache HTTP Server to recognize the different hostnames. Name-based virtual hosting also eases the demand for scarce IP addresses. Therefore you should use name-based virtual hosting unless you are using equipment that explicitly demands IP-based hosting.

http://www.jscape.com/blog/bid/87783/Forward-Proxy-vs-Reverse-Proxy
The Forward Proxy vs The Reverse Proxy
While a forward proxy proxies in behalf of clients (or requesting hosts), a reverse proxy proxies in behalf of servers. A reverse proxy accepts requests from external clients on behalf of servers stationed behind it.

To the client in our example, it is the reverse proxy that is providing file transfer services. The client is oblivious to the file transfer servers behind the proxy, which are actually providing those services. In effect, whereas a forward proxy hides the identities of clients, a reverse proxy hides the identities of servers.
Common uses for a reverse proxy server include:
Load balancing
Web acceleration – Reverse proxies can compress inbound and outbound data, as well as cache commonly requested content, both of which speed up the flow of traffic between clients and servers. They can also perform additional tasks such as SSL encryption to take load off of your web servers, thereby boosting their performance.
Security and anonymity
http://blogs.citrix.com/2010/10/04/reverse-vs-forward-proxy/

Reverse Proxies

A key component of Reverse Proxies is the ability to perform TCP Multiplexing. What this means is the incoming connections are terminated, pooled and new connections are established on the back-end using fewer number of server connections resulting in a TCP Multiplexing Ratio. A typical TCP Mux ratio is 10:1 – ten incoming connections to 1 back-end connection. Another benefit of this is that the connections on the back-end to the servers are kept open even when the incoming connections terminate so that they can be re-used when new incoming connections come in – reducing the time to establish server connections hence improving performance.

Reverse Proxies are good for:

Application Delivery including:
- Load Balancing (TCP Multiplexing)
- SSL Offload/Acceleration (SSL Multiplexing)
- Caching
- Compression
- Content Switching/Redirection
- Application Firewall
- Server Obfuscation
- Authentication
- Single Sign On

Forward Proxies are good for:

Content Filtering
eMail security
NAT’ing
Compliance Reporting

Adjust tomcat logging
http://stackoverflow.com/questions/4119213/how-to-set-level-logging-to-debug-in-tomcat

https://tomcat.apache.org/tomcat-8.0-doc/config/valve.html
server.xml:
https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Access_Log_Valve
http://www.symantec.com/connect/blogs/enhancing-tomcat-logging-improved-forensics
http://www.techstacks.com/howto/configure-access-logging-in-tomcat.html
Combined Log Format: %{X-Forwarded-For}i %l %u %t %r %s %b %{User-Agent}i %{Referer}i

http://serverfault.com/questions/391457/how-does-apache-merge-multiple-matching-location-sections
<Location ~ "/solr">
Options FollowSymLinks
Order allow,deny
Allow from all

ProxyPass http://ip:port/solr
ProxyPassReverse http://localhost/solr
</Location>

Header append Vary "Accept-Encoding"
#or
Header set Vary "Accept-Encoding"

  Header unset Vary


https://www.maxcdn.com/blog/accept-encoding-its-vary-important/

“Many HTTP caches decide that Vary: User-Agent is effectively Vary: * since the number of user-agents in the wild is so large. By asking to Vary on User-Agent you are asking your CDN to store many copies of your resource which is not very efficient for them, hence their turning off caching in this case.”

The HTTP Vary header is used by servers to indicate that the object being served will vary (the content will be different) based on some attribute of the incoming request, such as the requesting client's specified user-agent or language. The Akamai servers cannot cache different versions of the content based on the values of the Vary header. As a result, objects received with a Vary header that contains any value(s) other than "Accept-Encoding" will not be cached. To do so might result in some users receiving the incorrect version of the content (wrong language, etc.)

“Vary: User-Agent is broken for the Internet in general. ...the basic problem is that the user-agents vary so wildly that they are almost unique for every individual (not quite that bad but IE made it a mess by including the version numbers of .Net that are installed on users machines as part of the string). If you Vary on User-Agent then intermediate caches will pretty much end up never caching resources (like Akamai).”

https://community.akamai.com/community/web-performance/blog/2015/09/16/is-caching-on-akamai-impacted-by-vary-header

Akamai will cache the content only if Vary header has value "Accept-Encoding".

Vary: Accept-Encoding

Any other values won't be cached. Examples:

Vary: Accept-Encoding,Referer

Vary: Accept-Encoding,User-Agent

Vary: User-Agent

http://my.globaldots.com/knowledgebase.php?action=displayarticle&id=32

An "Accept-Encoding" value to the Vary header is the exception to this rule only when it relates to serving the object compressed. Since compression does not change the content of the object (only its size for transfer), an object that varies only by its compression can be safely cached.

To summarize, Akamai servers will cache the object, based on the configuration settings, in either of the following cases:

If "Vary: Accept-Encoding" is received and the content is served compressed ("Content-Encoding: gzip").
If no Vary header at all is received.

http://orcaman.blogspot.com/2013/08/cdn-caching-problems-vary-user-agent.html

If you accidentally send the header and you are using a CDN, you can expect to experience the following nasty problems:

1. Your server load will increase dramatically, since requests will hit the origin instead of getting the content from the CDN.

2. Your users will experience slower response times because your origin will have to recomptue the file's content on every request, and the Geo-based content delivery will not be activated (the request will have to reach your origin server anyway).

3. Your CDN bill might explode (the CDN's usually charge by bandwidth, and they will be serving a lot more content if they are not able to cache it).

The solution for this problem can be:

1. If you don't really need the vary-by: user-agent header, don't send it! (.NET guys - if you use Microsoft BundleConfigto bundle and minify your CSS/JS, you will get this header automatically. You will need to get the System.Web.Optimization source code, remove the header from there, and recompile the DLL).

2. You can try to tell the CDN to ignore the vary-by: user agent header (in Akamai it is possible by configuration).

https://www.fastly.com/blog/best-practices-for-using-the-vary-header

So now there's an object in the cache that has a little flag on it that says "only to be used for requests that have no Accept-Encoding in the request."

Vary: *

Don't use this, period.

The HTTP RFC says that if a Vary header contains the special header name *, each request for said URL is supposed to be treated as a unique (and uncacheable) request.

This is much better indicated by using Cache-Control: private, which is clearer to anyone reading the response headers. It also signifies that the object shouldn't ever be stored, which is much more secure.

Vary: Cookie

Cookie is probably one of the most unique request headers, and is therefore very bad. Cookies often carry authentication details, in which case you're better off not trying to cache pages, but just passing them through. If you're interested in caching with tracking cookies, read more here.

However, sometimes cookies are used for A/B testing purposes, in which case it's a good idea to Vary on a custom header and leave the Cookie header intact. This avoids a lot of additional logic to make sure the Cookie header is left for URLs that need it (and are probably not cacheable).