Fast Path Design
-------------------
This document describes the design of the data fastpath (FP) extension for
Linux. The fastpath ties into the connection tracking mechanism provided by
Linux Netfilter, to significantly improve network throughput performance. The
fastpath bypasses as much of the standard Linux packet processing path as
possible, to minimize instructions per packet. It does so by creating a "fast
path" that implements all packet mangling and routing at one point. When a
packet is received, a check is performed very early in the datapath to see if
the packet belongs to an existing connection for which a fastpath has been
established. If such a fastpath exists, all normal Linux processing of this
packet up to the egress queues is bypassed, with key packet mangling operations
copied from the fastpath connection information. If this fastpath
does not exist, the packet is processed normally by Linux.
The fastpath is designed to deal with the most commonly used traffic types,
namely TCP and UDP connections. Because these two traffic types comprise the
vast majority of traffic in a typical router, simply improving these two types
provides a significant performance improvement. Dealing with the many other
traffic types would greatly complicate this design, with little additional
performance improvement.
nfp_tracker
|
|
|
RX > netif_receive_skb > Netfilter > Route > Netfilter > Netfilter -+
driver | PREROUTE FORWARD POSTROUTE |
| mangle mangle mangle +-->
| nat filter nat Egress
| +-->
| |
+------------------- Bypass path ------------------------+
fastpath establishment (nfp_tracker)
----------------------------------------------------
The fastpath code is an extension of the conntrack mechanism built into Linux.
A fastpath is created by first allowing normal Linux processing to occur on a
new TCP or UDP connection. A tracker API called in the Netfilter POSTROUTING
stage creates a fastpath structure containing the necessary information to
mangle and deliver a packet. Mangling is required to support NAT and QoS. The
tracker API is registered to the POSTROUTING hookpoint for NF_BR_POST_ROUTING
and NF_IP_POST_ROUTING. The hook into NF_BR_POST_ROUTING allows the fastpath
to include bypassing of the bridging code, for traffic that is routed
to a bridge. When the tracker sees that the packet's output device is
a bridge device, it stops tracking, and waits to be called again from
the NF_BR_POST_ROUTING hook, where the output device will have been
set by the normal processing path to point to the true physical device.
The tracker (nfp_tracker() ) performs a number of sanity checks on each received
packet to determine whether or not to create a fastpath for the connection.
The following scenarios will prevent the tracker from establishing a fastpath:
* Invalid CONNMARK (see Conntrack Marking below)
* Locally sourced (see Local Traffic below)
* TCP conntrack not "ESTABLISHED and assured"
* outgoing interface is a bridge interface (as described above)
If none of these apply, the tracker will create a fastpath structure for this
connection, and subsequent packets belonging to this connection may be handled
entirely by the fastpath code.
Two structs are used to store the fastpath connection information:
struct nfp_struct
{
int iif;
struct net_device *output_dev;
__be32 saddr;
__be32 daddr;
__u16 sport;
__u16 dport;
__u32 mark;
__u32 priority;
__u8 tos;
struct dst_entry *dst;
u32 nat;
u32 csum_diff[4];
};
struct nfp_bidir_struct
{
struct nfp_struct nfp[IP_CT_DIR_MAX];
#ifdef CONFIG_IP_NF_CT_ACCT
struct nfp_stats_struct *stats;
#endif
};
struct nfp_struct contains all the information for one direction of the
connection, two instances are used to completely describe a fastpath. The
critical information stored in this structure is:
* output_dev: output device
* saddr, daddr, sport, dport: source and destination IP address and TCP/UDP
ports for packet mangling Only TCP and UDP packets are handled
* nat flag - indicates whether IP address and TCP/UDP port mangling required.
* mark, priority : sk_buff flags copied for QoS support
* csum_diff[] : array to store checksum data to calculate a fast TCP/UDP checksum.
These structs are attached to the conntrack structure via an added element (misc) to
struct ip_conntrack:
struct ip_conntrack
{
/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
plus 1 for any connection(s) we are `master' for */
struct nf_conntrack ct_general;
/* General-purpose pointer */
void *misc;
...
the allocated struct nfp_bidir_struct is assigned to *misc when the fastpath is created.
Bypass path
------------
The goal of the FP is to bypass as much of the normal processing path as
possible. To this end, the first non-driver codepoint in the network path was
chosen as the entry point into the fastpath code. netif_receive_skb() was
modified as follows:
rcu_read_lock();
netif_fn = rcu_dereference(netif_receive_fastpath);
if (netif_fn)
{
if (netif_fn(skb))
{
rcu_read_unlock();
return __netif_receive_skb(skb);
}
rcu_read_unlock();
return 0;
}
rcu_read_unlock();
return __netif_receive_skb(skb);
netif_receive_fastpath is a function registered using
netdev_register_netif_receive_fastpath()
and unregistered with
netdev_unregister_netif_receive_fastpath()
The original netif_receive_skb() function is renamed to __netif_receive_skb();
this is called if the registered fastpath function returns non-zero,
indicating it was unable to deliver the packet.
The fastpath code registers nfp_packet() as the fastpath handler function.
This function performs some basic checks to determine whether or not a packet
should be directed through the fastpath, or returned to be handled by standard
Linux processing. Any one of the follow conditions will cause a packet to be
returned to normal Linux processing:
* Packet type is neither IP or VLAN
* NULL input interface (indicates locally sourced packet)
* Shared skb
* Invalid IPv4 checksum
* Invalid packet length
* IP fragmented packet
* IP Options in packet
* Non TCP/UDP
* No conntrack exists yet
* Non-matching CONNMARK
* TCP packet contains SYN, RST, or FIN
* TCP conntrack check fails (tcp_in_window() )
* TTL <= 1
If none of the above conditions asserts, the packet is handled by the
fastpath. This handling involves assigning the TOS, IP source adrs, IP destination adrs,
TCP/UDP source & destination ports, skb priority & mark, all based on the values
saved in the fastpath structure for the connection. The TTL of the packet is
decremented, checksums are re-calculated, and ip_finish_output() is called to
deliver the packet to the egress queues.
Note that the fastpath handler is called for *every* packet that is received by
netif_receive_skb(). This results in a small performance decrease for all
packets that are not handled by the fastpath. This is not considered
significant, as the intent is to use the fastpath to handle the vast majority
of traffic.
VLAN traffic
________________
VLAN traffic is normally handled by Linux by calling the registered
handler for the VLAN packet type (in netif_receive_skb() ). The
default handler (vlan_skb_recv() in /net/8021q/vlan_dev.c) strips the
vlan header and requeues the packet into the backlog queue of the
running CPU. This is quite inefficient, especially for QinQ (stacked
VLAN tags). This "stripping" function usually involves more than
simply a pointer manipulation by skb_pull(); for most configurations
the source and destination MAC addresses are moved in memory to
overwrite the vlan tag from the packet. This is done to support some
layer 2 functions (bridging, DHCP, etc.), but is not at all necessary
for traffic handled by the fastpath.
The fastpath code includes it's own vlan reception code based on
vlan_skb_recv(), and calls this in a loop to remove all vlan tags
without need to requeue and dequeue the packet. Some minimal error
checks are done in this code (similar to those in the
vlan_skb_recv()), and any failed checks cause the packet to be
returned to normal Linux processing. If the packet is to be returned,
it is returned unmodified. The memory move described above is also
avoided for an additional performance gain.
Once all the VLAN tags have been parsed and stripped, it is still
possible for the packet to not be handled by fastpath, because IP
errors (see conditions described above). In this case, the packet is
restored to it's original state (i.e. all vlan tags are restored, and
vlan dev counters are restored), and then returned to normal Linux
processing.
Locally source and terminated traffic
------------------------------------
The fastpath is only intended to handle forwarded traffic. All traffic to/from
the host is handled normally by Linux. Such local traffic typically comprises
only a small amount of traffic in a router, and so little performance gain would
be realized by creating a fastpath for this traffic.
Bridged traffic
---------------
Traffic which is handled entirely by the bridge (i.e. *not* routed) will never
traverse through the fastpath (because conntrack will never see this traffic).
Throughput of bridged traffic is significantly higher than for routed traffic in
the standard Linux path, and any optimization of this path is outside the scope
of this work.
CONNMARK
-----------
The fastpath makes use of the CONNMARK capability of Netfilter to determine
whether or not packets are to be fastpath'd. Only packets whose conntrack
CONNMARK AND'd with the mark value configured for the fastpath (via
/proc/net/nfp_mark, default is 65536) will be considered for fastpath. This
mechanism allows iptables rules to be used to easily control which traffic
should be handled via the fastpath. The simplest case is to mark all TCP and
UDP packets with the appropriate CONNMARK so all such connections are handled by
the fastpath:
iptables -t mangle -A FORWARD -p TCP -j CONNMARK --set-mark 65536/65536
iptables -t mangle -A FORWARD -p UDP -j CONNMARK --set-mark 65536/65536
This simple case has a possible disadvantage in that it allows a fastpath for
unidirectional UDP traffic. If this is not desired, iptables rules can prevent
it. Here we allow unidirectional fastpaths only on ports 1024 thru 1034:
iptables -t mangle -A FORWARD -p TCP -j CONNMARK --set-mark 65536/65536
iptables -t mangle -A FORWARD -p UDP -m state --state ESTABLISHED -j MARK --set-mark 65536/65536
iptables -t mangle -A FORWARD -p UDP --dport 1024:1034 -j CONNMARK --set-mark 65536/65536
CONNMARK is frequently used in firewalls to allow RELATED connections to inherit
a MARK value from a connection. For example, a data connection in an ALG such
as FTP can inherit the MARK from the primary connection, and this mark can be
used for QoS control. For ALGs whose related ports are arbitrary, there is no
other convenient way to mark such related streams. By using a bitmask for the
nfp_mark value in the fastpath code, the traditional use of CONNMARK can be
preserved. For example, with the default value of 65536, only bit 16 of
CONNMARK is needed for fastpath, so bits 0-15 and 17-31 can be used for other
purposes. All known uses of CONNMARK require only a few unique values to be
stored. This scheme requires that the CONNMARK rules in iptables make use of
the mask capabilities in this rule. The example iptables configuration at the
end of this document shows how the CONNMARK is shared for fastpath and other
uses.
ALG considerations
------------------
To allow ALGs to function correctly, it is usually necessary to prevent traffic
that needs to be seen by an ALG from being handled by the fastpath. This
requires additional iptables rules. For example, to prevent packets destined to
port 21 (FTP) from being handled via the fastpath, the iptables rules should
change to:
iptables -t mangle -A FORWARD -p TCP --dport ! 21 -j CONNMARK --set-mark 65536/65536
iptables -t mangle -A FORWARD -p UDP -j CONNMARK --set-mark 65536/65536
Because port 21 destined packets are not fastpath'd, they will always be
handled by the normal Linux path, and the ftp alg will thus operate correctly.
Note it is *not* necessary to prevent fastpath of the connection that is RELATED
to the ftp connection. This is so because the ALG helper function for FTP does
not run on packets in the RELATED connection, but only on those in the primary
connection.
ALGs which need to see *all* traffic, including any RELATED streams, will need
further iptables rules to prevent fastpath handling of such traffic.
Network event handling
------------------------
Numerous network events occur that cause changes in packet forwarding. Because
the fastpath bypasses much of the networking code, normal mechanisms for the
effect of these events to propogate are not effective. For example, consider
routing table changes that may occur while fastpaths are already established.
Because the fastpath stores the output interface, and does not do routing
lookups for each packet, the routing change will not effect the fastpath handled
packets, and they will thus be delivered to the wrong interface.
The solution to this is to register notifiers to existing notification chains,
and create new notification chains where needed. The following network events
are considered:
* Interface down : A callback is registered via register_netdevice_notifier()
to handle interface DOWN events. The handler removes all fastpath
structures which include the device.
* IP Address change: register_inetaddr_notifier() is called to register a
handler that destroys all fastpath structures that include the device
whose address has changed.
* ip_conntrack notifications: registered handler to existing
ip_conntrack_chain, to handle changes to conntracks (expiring connections,
interface down, etc.) The handler only handles conntrack IPCT_DESTROY event
and destroy the associated fastpath structures for that conntrack. (Note:
as of kernel version 2.6.20, the ip_conntrack notification is still deemed
experimental).
* Routing changes: A new notification chain is added to net/ipv4/route.c.
This chain is called whenever the routing table is flushed. This provides
notification for all routing changes. The fastpath handler destroys all
fastpaths in this case, because it cannot know whether the routes inherent
in the fastpath structures are still valid. This does *not* cause packet
loss, because the packets will simply traverse the normal linux processing
path (causing the fastpath for that connection to be re-established).
* Netfilter/Iptables change events: Because the fastpath bypasses all
Netfilter processing, changes to netfilter via iptables commands will not
have any effect on established connections. This is obviously highly
undesirable, as a user could not introduce new iptables rules to block a
current DoS attack, for example. A notification chain has been added to
net/ipv4/netfilter/iptables.c, and is traversed whenever __do_replace is
called to install a new iptables ruleset. The fastpath handler for this
notification destroys all fastpath structures, identical to the routing
handler described above.
* bridge configuration changes: A new notification chain is added to the
net/bridge/br_ioctl.c, which is traversed whenever an interface is added or
removed from a bridge. The fastpath handler in this chain destroys all
fastpath structures which include the added/removed device.
IPTables Integration: Targets
-----------------------------
IPTables provides multiple hook points and a wide (and constantly growing) array
of filtering/matching functions, some of which integrate easily, and some that
are incompatible with a connection based data fastpath.
The following description of the IPTables targets and match criteria pertains
only to traffic that is NOT handled by ALGs. ALG-handled traffic is discussed
elsewhere in this document.
1. Pre-ESTABLISHED targets
The following targets: LOG, ULOG, ACCEPT, REJECT, DROP, NOTRACK, RETURN,
REDIRECT are supported and function normally only until a TCP/UDP connection is
ESTABLISHED and assured. Once a connection is ESTABLISHED and assured, the data
fastpath will be used, so any normal targets will not be reached.
Note that some of these targets (NFQUEUE, QUEUE, REDIRECT, REJECT, DROP,
NOTRACK) will prevent establishment of a conntrack connection, and so will work
normally as on unmodified Linux.
2. Packet / sk_buff modifier targets
Includes these targets: CLASSIFY, MARK, TOS, DSCP, SAME, SNAT, DNAT, TTL,
7MASQUERADE, NETMAP. These targets are supported by the fastpath code on a
per-connection basis. This means that during connection establishment, the
effect of these modifier targets is recorded by the fastpath tracker, and these
effects are replicated on all subsequent packets that are handled by the fastpath.
3. Unsupported targets
If packet modification based on non-connection characteristics of packets is
required (such as packet length, for example), the data fastpath must not be
used for that connection.
4. Summary of IPTables targets
Netfilter Target Modifies Behaviour with
fastpath
--------------- -------------- -------------------
CLASSIFY skb->priority Supported
CONNMARK ct->mark Unsupported
MARK skb->nfmark Supported
NFQUEUE (old) queue packets to userland Supported pre-ESTABLISHED
NOTRACK prevents conntrack on packets Supported pre-ESTABLISHED
SECMARK skb->secmark Unsupported
CONNSECMARK ct->secmark Unsupported
REJECT Drop packet Supported pre-ESTABLISHED
REDIRECT dest ip adrs to local adrs Supported pre-ESTABLISHED
RETURN chain traversal Supported pre-ESTABLISHED
SAME Source NAT Supported.
SNAT Source NAT Supported.
TCPMSS TCP MSS alteration Supported. (see Note 1)
TOS set TOS in IP header Supported.
TTL set TTL in IP header Unsupported.
ULOG logging Supported pre-ESTABLISHED
ACCEPT Supported pre-ESTABLISHED
CLUSTERIP For server clusters Unsupported
DNAT Destination NAT Supported
DROP Supported pre-ESTABLISHED
DSCP Set DSCP in IP header Supported
ECN TBD
LOG logging Supported pre-ESTABLISHED
MASQUERADE Source NAT Supported
NETMAP Source/Dest NAT Supported
Note 1. (should only be used for SYN packets, so not affected by data fastpath)
IPTables Integration : Matches
-------------------------------
All IPTables matches can be used during the establishment of a connection.
There are two types of matches relevant to fastpath, connection specific and
packet specific. Connection specific matches are those matches that are
invariant across all packets belonging to a specific conntrack connection, such
as:
* IP source/destination address
* TCP/UDP port
* IP Protocol (TCP/UDP)
Iptables rules that match on these parameters are not traversed by packets
belonging to an established fastpath connection, but because these parameter
*define* a connection, any change to these parameters will result in a new
connection, and hence such changed packets will traverse the ruleset.
All other match criteria can only be expected to match during the
pre-ESTABLISHED phase of a connection. Once the connection is established,
subsequent packets belonging to that connection will not traverse the IPTables
rules. For virtually all traffic, this behaviour is quite acceptable, as many
of the commonly used match criteria will match on fields that do not normally
change. For example, a match on a DSCP value will only be used during
connection establishment. Although it is theoretically possible for subsequent
packets belong to that connection to have a different DSCP value from that used
during pre-establishment, in practice this should not occur.
Examples of matches that do not work cleanly with the fastpath are:
* length : A rule could be devised to drop all VOIP packets of unexpected
length limit/hashlimit: limit and hashlimit rules match traffic rates, so to
work correctly, they would need to see all applicable packets.
* mac match: Because conntrack does monitor Layer 2 information, it is possible
to have packets come in with different mac addresses yet match the same
conntrack. This is likely a non-issue, as standard Linux firewalls based on
conntrack would also suffer from the same problem.
An exhaustive analysis of every possible netfilter matching criteria is beyond
the scope of this document. Rather, a simple guideline holds:
Traffic that is intended to be handled by the fastpath should only be
subjected to Netfilter matches that apply to all packets in a conntrack
connection.
The solution to cases where unsupported Netfilter matches are required is to not
allow a fastpath to be established for that conntrack connection, by not
writing the CONNMARK target for that connection. For example, the following
rules would prevent traffic to/from default bittorrent ports from traversing
through the fastpath, and limit the bittorrent rate to 1000 pkts/sec:
iptables -t mangle -A FORWARD -j CONNMARK --set-mark 1234
iptables -t mangle -A FORWARD -p TCP --dport 6881:6889 -j CONNMARK --set-mark 0
iptables -A PREROUTING -p TCP -dport 6881:6889 -m limit --limit 1000/second \
--limit-burst 1000/second -j ACCEPT
Fastpath monitoring and statistics
-----------------------------------
When fastpath is enabled, the conntrack table available from
/proc/net/ip_conntrack is extended to include a flag indicating whether or not a
connection is handled by the fastpath. For example, here are two conntrack
entries, one handled by fastpath, the second by normal processing:
udp 17 27 src=192.168.0.3 dst=192.168.2.2 sport=1025 dport=1025 ...
udp 17 177 [FAST_PATH] src=192.168.2.2 dst=192.168.0.3 sport=1025 ...
Two Linux kernel compilation flags control enabling of fastpath statistics.
CONFIG_PMC_SM_STATS enables overall fastpath statistics, which can be viewed by
reading /proc/net/pmc_sm_stats. Writing to this /proc entry will clear these
statistics. The statistics included are:
Counter Description
---------- ------------
Active TCP FAST_PATH Current number of TCP connections that will be
connections handled by the fastpath
Active UDP FAST_PATH Current number of UDP "connections" that will be
connections handled by the fastpath
FAST_PATH Packets handled Total number of packets that travelled the
fastpath
Packets unhandled by FAST_PATH due to
Unsupported L4 protocol Total number of packets unhandled by the
fastpath due to an unsupported Layer 4
protocol (i.e. neither TCP nor UDP)
No FAST_PATH established Total number of packets unhandled by the fast
path due to an un-established connection
Wrong CONNMARK packet CONNMARK AND /proc/net/nfp_mark == 0
This means this packet was not marked by
iptables to be delivered via fastpath.
Other Total number of packets unhandled due to reasons
other than the above, e.g. packet fragments,
invalid IP checksums, or internal errors.
CONFIG_IP_NF_CT_ACCT, along with enabling other connection tracking accounting
information, enables per-connection statistics. These can be viewed by reading
/proc/net/ip_conntrack, which will show "handled" and "unhandled" packet
counts. "Handled" packets are those that were associated with the connection and
handled by the fastpath. "Unhandled" packets are those that were associated
with the connection but not handled by the fastpath, for example out-of-window
TCP packets. Note that the per-connection statistics cannot be reset by the
user.
When the fastpath code is enabled, the standard tcp and udp conntrack printing
functions are replaced with fastpath aware versions. This was done to minimize
code changes to the standard netfilter codebase. The default printing functions
are restored when the fastpath is disabled (or rmmod'd if applicable).
Enabling fastpath
--------------------
If compiled into the kernel, or when the module is loaded, fastpath is enabled
by default. Writing a 0 to /proc/net/nfp_enable will disable fastpath, and
unregister all hooks and notifiers. This allows the fastpath code to be
compiled into the kernel, and subsequently be disabled to have virtually zero
performance impact. Writing a 1 to /proc/net/nfp_enable will re-enable
fastpath and register all notifications and hooks.
Unsupported traffic
-------------------
IPv6
IPv6 traffic is not handled by this version of the fastpath.
Multicast
Multicast traffic is not handled by this version of the fastpath. Dealing with
multicast would require tracking of IGMP packets, which is typically done in
Linux by userland daemons such as mrouted. There are no known IGMP connection
tracking modules available, and writing such a module is well outside the scope
of this development.
Local traffic
Packets belonging to connections terminated in the RG are never routed through
the fastpath. The data path for such packets is very different from that used
for forwarded packets; supporting this data path would significantly complicate
the session matching design, with only a small performance benefit (due to
relatively low rates of locally terminated traffic).
Sample iptables configuration
------------------------------
# This example shows how iptables is configured to set the CONNMARK to fastpath
# all packets except those on Port 21, which are handled by the FTP ALG.
# CONNMARK masks are used to show how fastpath usage of CONNMARK can work
# together with other typical uses (inheritance of MARK by RELATED connections).
WAN=eth2
LAN=eth0
# Grab the fastpath mask value from procfs:
FMASK=`cat /proc/net/nfp_mark`
# Need an inverted version of this mask for non-fastpath usage of CONNMARK
CMASK=$(( ~ $FMASK ))
## PREROUTING ################################
# In PREROUTING, we use CONNMARK restore to restore any previous mark from this
# connection. This includes packets belonging to RELATED connections, such as
# the data connection in an ALG-controlled connection.
iptables -t mangle -A PREROUTING -p TCP -j CONNMARK --restore-mark --mask $CMASK
## fastpath #################################
# Create a new target to house our fastpath rules.
iptables -t mangle -N fastpath
# TCP port 21 (ftp) is not CONNMARKed with the fastpath mask, so that the ALG will
# see all port 21 packets. The related data connection can still be fastpath'd
iptables -t mangle -A fastpath -p TCP --dport 21 -j RETURN
iptables -t mangle -A fastpath -p TCP --sport 21 -j RETURN
# set fastpath bit of all other TCP/UDP packets. Note we are using the mask
# capabilities of CONNMARK to only write the bits used for fastpath marking.
iptables -t mangle -A fastpath -p TCP -j CONNMARK --set-mark ${FMASK}/${FMASK}
iptables -t mangle -A fastpath -p UDP -j CONNMARK --set-mark ${FMASK}/${FMASK}
## FORWARD ##################################
# Run the fastpath rules
iptables -t mangle -N fastpath
## POSTROUTING ######################################
# Mark ftp packets. Note mark must be purely within $CMASK bits, because
# only the CMASK bits are saved into the CONNMARK here. We don't want to
# overwrite the fastpath bits in the CONNMARK.
iptables -t mangle -A POSTROUTING -p TCP --dport 21 -j MARK --set-mark 1
# save MARK to CONNMARK for subsequent packets in this connection or
# in a RELATED connection. Note usage of mask to preserve fastpath marking.
iptables -t mangle -A POSTROUTING -p TCP -j CONNMARK --save-mark --mask $CMASK