PMC Fastpath


 Fast Path Design
-------------------

This document describes the design of the data fastpath (FP) extension for
Linux.  The fastpath ties into the connection tracking mechanism provided by
Linux Netfilter, to significantly improve network throughput performance.  The
fastpath bypasses as much of the standard Linux packet processing path as
possible, to minimize instructions per packet.  It does so by creating a "fast
path" that implements all packet mangling and routing at one point.  When a
packet is received, a check is performed very early in the datapath to see if
the packet belongs to an existing connection for which a fastpath has been
established.  If such a fastpath exists, all normal Linux processing of this
packet up to the egress queues is bypassed, with key packet mangling operations
copied from the fastpath connection information.  If this fastpath
does not exist, the packet is processed normally by Linux.
The fastpath is designed to deal with the most commonly used traffic types,
namely TCP and UDP connections.  Because these two traffic types comprise the
vast majority of traffic in a typical router, simply improving these two types
provides a significant performance improvement.  Dealing with the many other
traffic types would greatly complicate this design, with little additional
performance improvement.
                                                             
                                                           nfp_tracker
                                                               |
                                                               |
                                                               |
 RX  > netif_receive_skb > Netfilter > Route > Netfilter > Netfilter -+
driver       |             PREROUTE            FORWARD     POSTROUTE  |  
             |             mangle              mangle      mangle     +-->
             |             nat                 filter      nat             Egress
             |                                                        +-->  
             |                                                        |
             +------------------- Bypass path ------------------------+
         


fastpath establishment (nfp_tracker)
----------------------------------------------------
The fastpath code is an extension of the conntrack mechanism built into Linux.
A fastpath is created by first allowing normal Linux processing to occur on a
new TCP or UDP connection.  A tracker API called in the Netfilter POSTROUTING
stage creates a fastpath structure containing the necessary information to
mangle and deliver a packet.  Mangling is required to support NAT and QoS.  The
tracker API is registered to the POSTROUTING hookpoint for NF_BR_POST_ROUTING
and NF_IP_POST_ROUTING.  The hook into NF_BR_POST_ROUTING allows the fastpath
to include bypassing of the bridging code, for traffic that is routed
to a bridge.  When the tracker sees that the packet's output device is
a bridge device, it stops tracking, and waits to be called again from
the NF_BR_POST_ROUTING hook, where the output device will have been
set by the normal processing path to point to the true physical device.

The tracker (nfp_tracker() ) performs a number of sanity checks on each received
packet to determine whether or not to create a fastpath for the connection.
The following scenarios will prevent the tracker from establishing a fastpath:

 * Invalid CONNMARK (see Conntrack Marking below)
 * Locally sourced (see Local Traffic below)
 * TCP conntrack not "ESTABLISHED and assured"
 * outgoing interface is a bridge interface (as described above)

If none of these apply, the tracker will create a fastpath structure for this
connection, and subsequent packets belonging to this connection may be handled
entirely by the fastpath code.

Two structs are used to store the fastpath connection information:
    struct nfp_struct
    {
        int                 iif;
        struct net_device   *output_dev;
        __be32              saddr;
        __be32              daddr;
        __u16               sport;
        __u16               dport;
        __u32               mark;
        __u32               priority;
        __u8                tos;
        struct dst_entry    *dst;
        u32                 nat;
        u32                 csum_diff[4];
    };

    struct nfp_bidir_struct
    {
        struct nfp_struct nfp[IP_CT_DIR_MAX];
    #ifdef CONFIG_IP_NF_CT_ACCT
        struct nfp_stats_struct *stats;
    #endif
    };

struct nfp_struct contains all the information for one direction of the
connection, two instances are used to completely describe a fastpath.  The
critical information stored in this structure is:

 * output_dev:  output device
 * saddr, daddr, sport, dport: source and destination IP address and TCP/UDP
   ports for packet mangling Only TCP and UDP packets are handled
 * nat flag - indicates whether IP address and TCP/UDP port mangling required.
 * mark, priority : sk_buff flags copied for QoS support
 * csum_diff[] : array to store checksum data to calculate a fast TCP/UDP checksum.

These structs are attached to the conntrack structure via an added element (misc) to
struct ip_conntrack:

struct ip_conntrack
{
/* Usage count in here is 1 for hash table/destruct timer, 1 per skb,
           plus 1 for any connection(s) we are `master' for */
struct nf_conntrack ct_general;

/* General-purpose pointer */
void *misc;

...

the allocated struct nfp_bidir_struct is assigned to *misc when the fastpath is created.


Bypass path
------------
The goal of the FP is to bypass as much of the normal processing path as
possible.  To this end, the first non-driver codepoint in the network path was
chosen as the entry point into the fastpath code.  netif_receive_skb() was
modified as follows:

        rcu_read_lock();
        netif_fn = rcu_dereference(netif_receive_fastpath);
        if (netif_fn)
        {
                if (netif_fn(skb))
                {
                        rcu_read_unlock();
                        return __netif_receive_skb(skb);
                }

                rcu_read_unlock();
                return 0;
        }

        rcu_read_unlock();
        return __netif_receive_skb(skb);

netif_receive_fastpath is a function registered using
  netdev_register_netif_receive_fastpath()
and unregistered with
  netdev_unregister_netif_receive_fastpath()

The original netif_receive_skb() function is renamed to __netif_receive_skb();
this is called if the registered fastpath function returns non-zero,
indicating it was unable to deliver the packet.

The fastpath code registers nfp_packet() as the fastpath handler function.
This function performs some basic checks to determine whether or not a packet
should be directed through the fastpath, or returned to be handled by standard
Linux processing.  Any one of the follow conditions will cause a packet to be
returned to normal Linux processing:
 * Packet type is neither IP or VLAN
 * NULL input interface (indicates locally sourced packet)
 * Shared skb
 * Invalid IPv4 checksum
 * Invalid packet length
 * IP fragmented packet
 * IP Options in packet
 * Non TCP/UDP
 * No conntrack exists yet
 * Non-matching CONNMARK
 * TCP packet contains SYN, RST, or FIN
 * TCP conntrack check fails (tcp_in_window() )
 * TTL <= 1

If none of the above conditions asserts, the packet is handled by the
fastpath.  This handling involves assigning the TOS, IP source adrs, IP destination adrs,
TCP/UDP source & destination ports, skb priority & mark, all based on the values
saved in the fastpath structure for the connection.  The TTL of the packet is
decremented, checksums are re-calculated, and ip_finish_output() is called to
deliver the packet to the egress queues.

Note that the fastpath handler is called for *every* packet that is received by
netif_receive_skb().  This results in a small performance decrease for all
packets that are not handled by the fastpath.  This is not considered
significant, as the intent is to use the fastpath to handle the vast majority
of traffic.

VLAN traffic
________________

VLAN traffic is normally handled by Linux by calling the registered
handler for the VLAN packet type (in netif_receive_skb() ).  The
default handler (vlan_skb_recv() in /net/8021q/vlan_dev.c) strips the
vlan header and requeues the packet into the backlog queue of the
running CPU. This is quite inefficient, especially for QinQ (stacked
VLAN tags). This "stripping" function usually involves more than
simply a pointer manipulation by skb_pull(); for most configurations
the source and destination MAC addresses are moved in memory to
overwrite the vlan tag from the packet.  This is done to support some
layer 2 functions (bridging, DHCP, etc.), but is not at all necessary
for traffic handled by the fastpath.

The fastpath code includes it's own vlan reception code based on
vlan_skb_recv(), and calls this in a loop to remove all vlan tags
without need to requeue and dequeue the packet.  Some minimal error
checks are done in this code (similar to those in the
vlan_skb_recv()), and any failed checks cause the packet to be
returned to normal Linux processing.  If the packet is to be returned,
it is returned unmodified.  The memory move described above is also
avoided for an additional performance gain.

Once all the VLAN tags have been parsed and stripped, it is still
possible for the packet to not be handled by fastpath, because IP
errors (see conditions described above).  In this case, the packet is
restored to it's original state (i.e. all vlan tags are restored, and
vlan dev counters are restored), and then returned to normal Linux
processing.


Locally source and terminated traffic
------------------------------------
The fastpath is only intended to handle forwarded traffic.  All traffic to/from
the host is handled normally by Linux.  Such local traffic typically comprises
only a small amount of traffic in a router, and so little performance gain would
be realized by creating a fastpath for this traffic.

Bridged traffic
---------------
Traffic which is handled entirely by the bridge (i.e. *not* routed) will never
traverse through the fastpath (because conntrack will never see this traffic).
Throughput of bridged traffic is significantly higher than for routed traffic in
the standard Linux path, and any optimization of this path is outside the scope
of this work.


CONNMARK
-----------
The fastpath makes use of the CONNMARK capability of Netfilter to determine
whether or not packets are to be fastpath'd.  Only packets whose conntrack
CONNMARK AND'd with the mark value configured for the fastpath (via
/proc/net/nfp_mark, default is 65536) will be considered for fastpath.  This
mechanism allows iptables rules to be used to easily control which traffic
should be handled via the fastpath. The simplest case is to mark all TCP and
UDP packets with the appropriate CONNMARK so all such connections are handled by
the fastpath:

    iptables -t mangle -A FORWARD -p TCP -j CONNMARK --set-mark 65536/65536
    iptables -t mangle -A FORWARD -p UDP -j CONNMARK --set-mark 65536/65536

This simple case has a possible disadvantage in that it allows a fastpath for
unidirectional UDP traffic.  If this is not desired, iptables rules can prevent
it.  Here we allow unidirectional fastpaths only on ports 1024 thru 1034:

iptables -t mangle -A FORWARD -p TCP -j CONNMARK --set-mark 65536/65536
iptables -t mangle -A FORWARD -p UDP -m state --state ESTABLISHED -j MARK --set-mark 65536/65536
iptables -t mangle -A FORWARD -p UDP --dport 1024:1034 -j CONNMARK --set-mark 65536/65536

CONNMARK is frequently used in firewalls to allow RELATED connections to inherit
a MARK value from a connection.  For example, a data connection in an ALG such
as FTP can inherit the MARK from the primary connection, and this mark can be
used for QoS control.  For ALGs whose related ports are arbitrary, there is no
other convenient way to mark such related streams.  By using a bitmask for the
nfp_mark value in the fastpath code, the traditional use of CONNMARK can be
preserved.  For example, with the default value of 65536, only bit 16 of
CONNMARK is needed for fastpath, so bits 0-15 and 17-31 can be used for other
purposes.  All known uses of CONNMARK require only a few unique values to be
stored.  This scheme requires that the CONNMARK rules in iptables make use of
the mask capabilities in this rule.  The example iptables configuration at the
end of this document shows how the CONNMARK is shared for fastpath and other
uses.


ALG considerations
------------------
To allow ALGs to function correctly, it is usually necessary to prevent traffic
that needs to be seen by an ALG from being handled by the fastpath.  This
requires additional iptables rules.  For example, to prevent packets destined to
port 21 (FTP) from being handled via the fastpath, the iptables rules should
change to:

    iptables -t mangle -A FORWARD -p TCP --dport ! 21 -j CONNMARK --set-mark 65536/65536
    iptables -t mangle -A FORWARD -p UDP -j CONNMARK --set-mark 65536/65536

Because port 21 destined packets are not fastpath'd, they will always be
handled by the normal Linux path, and the ftp alg will thus operate correctly.
Note it is *not* necessary to prevent fastpath of the connection that is RELATED
to the ftp connection.  This is so because the ALG helper function for FTP does
not run on packets in the RELATED connection, but only on those in the primary
connection.
ALGs which need to see *all* traffic, including any RELATED streams, will need
further iptables rules to prevent fastpath handling of such traffic.



Network event handling
------------------------

Numerous network events occur that cause changes in packet forwarding.  Because
the fastpath bypasses much of the networking code, normal mechanisms for the
effect of these events to propogate are not effective.  For example, consider
routing table changes that may occur while fastpaths are already established.
Because the fastpath stores the output interface, and does not do routing
lookups for each packet, the routing change will not effect the fastpath handled
packets, and they will thus  be delivered to the wrong interface.

The solution to this is to register notifiers to existing notification chains,
and create new notification chains where needed.  The following network events
are considered:
  * Interface down : A callback is registered via register_netdevice_notifier()
    to handle interface DOWN events. The handler removes all fastpath
    structures which include the device.

  * IP Address change:  register_inetaddr_notifier() is called to register a
    handler that destroys all fastpath structures that include the device
    whose address has changed.
 
  * ip_conntrack notifications: registered handler to existing
    ip_conntrack_chain, to handle changes to conntracks (expiring connections,
    interface down, etc.)  The handler only handles conntrack IPCT_DESTROY event
    and destroy the associated fastpath structures for that conntrack.  (Note:
    as of kernel version 2.6.20, the ip_conntrack notification is still deemed
    experimental).

  * Routing changes: A new notification chain is added to net/ipv4/route.c.
    This chain is called whenever the routing table is flushed.  This provides
    notification for all routing changes.  The fastpath handler destroys all
    fastpaths in this case, because it cannot know whether the routes inherent
    in the fastpath structures are still valid.  This does *not* cause packet
    loss, because the packets will simply traverse the normal linux processing
    path (causing the fastpath for that connection to be re-established).

  * Netfilter/Iptables change events: Because the fastpath bypasses all
    Netfilter processing, changes to netfilter via iptables commands will not
    have any effect on established connections.  This is obviously highly
    undesirable, as a user could not introduce new iptables rules to block a
    current DoS attack, for example.  A notification chain has been added to
    net/ipv4/netfilter/iptables.c, and is traversed whenever __do_replace is
    called to install a new iptables ruleset.  The fastpath handler for this
    notification destroys all fastpath structures, identical to the routing
    handler described above.

  * bridge configuration changes:  A new notification chain is added to the
    net/bridge/br_ioctl.c, which is traversed whenever an interface is added or
    removed from a bridge.  The fastpath handler in this chain destroys all
    fastpath structures which include the added/removed device.
 
 

IPTables Integration: Targets
-----------------------------

IPTables provides multiple hook points and a wide (and constantly growing) array
of filtering/matching functions, some of  which integrate easily, and some that
are incompatible with a connection based data fastpath.

The following description of the IPTables targets and match criteria pertains
only to traffic that is NOT handled by ALGs.  ALG-handled traffic is discussed
elsewhere in this document.

  1. Pre-ESTABLISHED targets

The following targets: LOG, ULOG, ACCEPT, REJECT, DROP, NOTRACK, RETURN,
REDIRECT are supported and function normally only until a TCP/UDP connection is
ESTABLISHED and assured.  Once a connection is ESTABLISHED and assured, the data
fastpath will be used, so any normal targets will not be reached.


Note that some of these targets (NFQUEUE, QUEUE, REDIRECT, REJECT, DROP,
NOTRACK) will prevent establishment of a conntrack connection, and so will work
normally as on unmodified Linux.

  2. Packet / sk_buff  modifier targets

Includes these targets: CLASSIFY, MARK, TOS, DSCP, SAME, SNAT, DNAT, TTL,
7MASQUERADE, NETMAP.  These targets are supported by the fastpath code on a
per-connection basis.  This means that during connection establishment, the
effect of these modifier targets is recorded by the fastpath tracker, and these
effects are replicated on all subsequent packets that are handled by the fastpath.

  3. Unsupported targets

If packet modification based on non-connection characteristics of packets is
required (such as packet length, for example), the data fastpath must not be
used for that connection.

  4.  Summary of IPTables targets

Netfilter Target    Modifies                        Behaviour with
                                                    fastpath
---------------     --------------                  -------------------
CLASSIFY            skb->priority                   Supported
CONNMARK            ct->mark                        Unsupported
MARK                skb->nfmark                     Supported
NFQUEUE (old)       queue packets to userland       Supported pre-ESTABLISHED
NOTRACK             prevents conntrack on packets   Supported pre-ESTABLISHED
SECMARK             skb->secmark                    Unsupported
CONNSECMARK         ct->secmark                     Unsupported
REJECT              Drop packet                     Supported pre-ESTABLISHED
REDIRECT            dest ip adrs to local adrs      Supported pre-ESTABLISHED
RETURN              chain traversal                 Supported pre-ESTABLISHED
SAME                Source NAT                      Supported.
SNAT                Source NAT                      Supported.
TCPMSS              TCP MSS alteration              Supported.  (see Note 1)
TOS                 set TOS in IP header            Supported.
TTL                 set TTL in IP header            Unsupported.
ULOG                logging                         Supported pre-ESTABLISHED
ACCEPT                                              Supported pre-ESTABLISHED
CLUSTERIP           For server clusters             Unsupported
DNAT                Destination NAT                 Supported
DROP                                                Supported pre-ESTABLISHED
DSCP                Set DSCP in IP header           Supported
ECN                                                 TBD
LOG                 logging                         Supported pre-ESTABLISHED
MASQUERADE          Source NAT                      Supported
NETMAP              Source/Dest NAT                 Supported

Note 1. (should only be used for SYN packets, so not affected by data fastpath)


IPTables Integration : Matches
-------------------------------

All IPTables matches can be used during the establishment of a connection.
There are two types of matches relevant to fastpath, connection specific and
packet specific.   Connection specific matches are those matches that are
invariant across all packets belonging to a specific conntrack connection, such
as:

 * IP source/destination address
 * TCP/UDP port
 * IP Protocol (TCP/UDP)

Iptables rules that match on these parameters are not traversed by packets
belonging to an established fastpath connection, but because these parameter
*define* a connection, any change to these parameters will result in a new
connection, and hence such changed packets will traverse the ruleset.

All other match criteria can only be expected to match during the
pre-ESTABLISHED phase of a connection.  Once the connection is established,
subsequent packets belonging to that connection will not traverse the IPTables
rules.  For virtually all traffic, this behaviour is quite acceptable, as many
of the commonly used match criteria will match on fields that do not normally
change.  For example, a match on a DSCP value will only be used during
connection establishment.  Although it is theoretically possible for subsequent
packets belong to that connection to have a different DSCP value from that used
during pre-establishment, in practice this should not occur.

Examples of matches that do not work cleanly with the fastpath are:

 * length : A rule could be devised to drop all VOIP packets of unexpected
   length limit/hashlimit: limit and hashlimit rules match traffic rates, so to
   work correctly, they would need to see all applicable packets.

 * mac match: Because conntrack does monitor Layer 2 information, it is possible
   to have packets come in with different mac addresses yet match the same
   conntrack.  This is likely a non-issue, as standard Linux firewalls based on
   conntrack would also suffer from the same problem.

An exhaustive analysis of every possible netfilter matching criteria is beyond
the scope of this document.  Rather, a simple guideline holds:

  Traffic that is intended to be handled by the fastpath should only be
  subjected to Netfilter matches that apply to all packets in a conntrack
  connection.

The solution to cases where unsupported Netfilter matches are required is to not
allow a fastpath to be established for that conntrack connection, by not
writing the CONNMARK target for that connection.  For example, the following
rules would prevent traffic to/from default bittorrent ports from traversing
through the fastpath, and limit the bittorrent rate to 1000 pkts/sec:

iptables -t mangle -A FORWARD -j CONNMARK --set-mark 1234
iptables -t mangle -A FORWARD -p TCP --dport 6881:6889 -j CONNMARK --set-mark 0
iptables -A PREROUTING -p TCP -dport 6881:6889 -m limit --limit 1000/second \
               --limit-burst 1000/second -j ACCEPT


Fastpath monitoring and statistics
-----------------------------------

When fastpath is enabled, the conntrack table available from
/proc/net/ip_conntrack is extended to include a flag indicating whether or not a
connection is handled by the fastpath.  For example, here are two conntrack
entries, one handled by fastpath, the second by normal processing:
 udp      17 27 src=192.168.0.3 dst=192.168.2.2 sport=1025 dport=1025 ...
 udp      17 177 [FAST_PATH] src=192.168.2.2 dst=192.168.0.3 sport=1025 ...

Two Linux kernel compilation flags control enabling of fastpath statistics.
CONFIG_PMC_SM_STATS enables overall fastpath statistics, which can be viewed by
reading /proc/net/pmc_sm_stats. Writing to this /proc entry will clear these
statistics. The statistics included are:
   Counter                     Description
   ----------                  ------------
   Active TCP FAST_PATH        Current number of TCP connections that will be
   connections                 handled by the fastpath

   Active UDP FAST_PATH        Current number of UDP "connections" that will be
   connections                 handled by the fastpath

   FAST_PATH Packets handled   Total number of packets that travelled the
                                fastpath

   Packets unhandled by FAST_PATH due to

     Unsupported L4 protocol      Total number of packets unhandled by the
                                  fastpath due to an unsupported Layer 4
                                  protocol (i.e. neither TCP nor UDP)

     No FAST_PATH  established   Total number of packets unhandled by the fast
                                  path due to an un-established connection
   
     Wrong CONNMARK               packet CONNMARK AND /proc/net/nfp_mark == 0
                                  This means this packet was not marked by
                                  iptables to be delivered via fastpath.

     Other                        Total number of packets unhandled due to reasons
                                  other than the above, e.g. packet fragments,
                                  invalid IP checksums, or internal errors.


CONFIG_IP_NF_CT_ACCT, along with enabling other connection tracking accounting
information, enables per-connection statistics. These can be viewed by reading
/proc/net/ip_conntrack, which will show "handled" and "unhandled" packet
counts. "Handled" packets are those that were associated with the connection and
handled by the fastpath. "Unhandled" packets are those that were associated
with the connection but not handled by the fastpath, for example out-of-window
TCP packets. Note that the per-connection statistics cannot be reset by the
user.

When the fastpath code is enabled, the standard tcp and udp conntrack printing
functions are replaced with fastpath aware versions.  This was done to minimize
code changes to the standard netfilter codebase.  The default printing functions
are restored when the fastpath is disabled (or rmmod'd if applicable).


Enabling fastpath
--------------------
If compiled into the kernel, or when the module is loaded, fastpath is enabled
by default.  Writing a 0 to /proc/net/nfp_enable will disable fastpath, and
unregister all hooks and notifiers.  This allows the fastpath code to be
compiled into the kernel, and subsequently be disabled to have virtually zero
performance impact.  Writing a 1 to /proc/net/nfp_enable will re-enable
fastpath and register all notifications and hooks.


Unsupported traffic
-------------------
   IPv6
IPv6 traffic is not handled by this version of the fastpath.

   Multicast
Multicast traffic is not handled by this version of the fastpath.  Dealing with
multicast would require tracking of IGMP packets, which is typically done in
Linux by userland daemons such as mrouted.  There are no known IGMP connection
tracking modules available, and writing such a module is well outside the scope
of this development.

   Local traffic
Packets belonging to connections terminated in the RG are never routed through
the fastpath.  The data path for such packets is very different from that used
for forwarded packets; supporting this data path would significantly complicate
the session matching design, with only a small performance benefit (due to
relatively low rates of locally terminated traffic).


Sample iptables configuration
------------------------------
# This example shows how iptables is configured to set the CONNMARK to fastpath
# all packets except those on Port 21, which are handled by the FTP ALG.
# CONNMARK masks are used to show how fastpath usage of CONNMARK can work
# together with other typical uses (inheritance of MARK by RELATED connections).
WAN=eth2                                                                      
LAN=eth0                                                                      

# Grab the fastpath mask value from procfs:
FMASK=`cat /proc/net/nfp_mark`            
# Need an inverted version of this mask for non-fastpath usage of CONNMARK                        
CMASK=$(( ~ $FMASK ))                                              

## PREROUTING ################################
# In PREROUTING, we use CONNMARK restore to restore any previous mark from this
# connection. This includes packets belonging to RELATED connections, such as
# the data connection in an ALG-controlled connection.

iptables  -t mangle -A PREROUTING -p TCP -j CONNMARK --restore-mark --mask $CMASK


## fastpath #################################                                  
# Create a new target to house our fastpath rules.
iptables -t mangle -N fastpath                                                
# TCP port 21 (ftp) is not CONNMARKed with the fastpath mask, so that the ALG will
# see all port 21 packets.  The related data connection can still be fastpath'd
iptables -t mangle -A fastpath -p TCP --dport  21 -j RETURN                    
iptables -t mangle -A fastpath -p TCP --sport  21 -j RETURN                    
                                                                             
# set fastpath bit of all other TCP/UDP packets.  Note we are using the mask
# capabilities of CONNMARK to only write the bits used for fastpath marking.
iptables -t mangle -A fastpath -p TCP -j CONNMARK --set-mark ${FMASK}/${FMASK}
iptables -t mangle -A fastpath -p UDP -j CONNMARK --set-mark ${FMASK}/${FMASK}



## FORWARD ##################################
# Run the fastpath rules
iptables -t mangle -N fastpath                                                
                                                                               

## POSTROUTING ######################################
# Mark ftp packets.  Note mark must be purely within $CMASK bits, because
# only the CMASK bits are saved into the CONNMARK here.  We don't want to
# overwrite the fastpath bits in the CONNMARK.

iptables -t mangle -A POSTROUTING -p TCP --dport 21 -j MARK --set-mark 1      

# save MARK to CONNMARK for subsequent packets in this connection or          
# in a RELATED connection. Note usage of mask to preserve fastpath marking.  
iptables -t mangle -A POSTROUTING -p TCP -j CONNMARK --save-mark --mask $CMASK
                                                                               

netif_rx解析

netif_rx解析

网络设备在接收到来自网络中其它主机的数据报,或本地环回接口的数据报之后,交给协议栈的netif_rx函数,该函数首先要为收到的这个skb打上当前的时间戳(skb->tstamp成员),这个时间戳表示该数据到达的时间,它不是必选的,可以通过套接字选项SO_TIMESTAMP将其打开,该选项打开时间戳时,会将链路层的全局变量netstamp_needed加1,netif_rx在检查到这个变量不为零时,为skb打上时间戳。
    softnet_data是类型为struct softnet_data结构体的全局变量,每个CPU定义一个,它是链路层的数据接收队列,该结构体的定义如下:
    struct softnet_data
    {
        struct net_device   *output_queue;
        struct sk_buff_head input_pkt_queue;
        struct list_head    poll_list;
        struct sk_buff      *completion_queue;
        struct net_device   backlog_dev;
    };
    input_pkt_queue是skb的队列,接收到的skb全部进入该队列等待后续处理,netif_rx首先检查该队列当前的长度input_pkt_queue.qlen,即当前排在队列中的skb的数量,当数量超过netdev_max_backlog的值时,直接丢弃新收到的包,netdev_max_backlog在协议栈中定义的缺省值为1000,可以通过文件/proc/sys/net/core/netdev_max_backlog进行修改。如果当前队列长度未达到上限,把新收到的skb加到这个队列中,在加到队列之前,要确保对这个队列的接收处理已启动,如果当前队列为空,则要先调用netif_rx_schedule启动队列的处理,再把skb加到队列中。需要注意的是softnet_data是CPU绑定的,但不是网络设备绑定的,多个网络设备收到的数据报可能存放在同一个队列中待处理。
    netif_rx_schedule函数的主要作用是触发一个软中断NET_RX_SOFTIRQ,使中断处理函数net_rx_action处理接收队列中的数据报。net_rx_action开始时会记录下系统的当前时间,然后进行处理,当处理时间持续超过1个时钟嘀嗒时,它会再触发一个中断NET_RX_SOFTIRQ,并退出,在下一个中断中继续处理。一次中断处理除了时间上有限制,处理的数据报的数量上也有限制。
    softnet_data的成员poll_list中存放的是成员backlog_dev的地址,由netif_rx_schedule存入,backlog_dev的成员poll在系统初始化时被指向函数process_backlog,net_rx_action调用该函数进行实际的数据报处理,process_backlog把数据报从input_pkt_queue队列中取出,传给netif_receive_skb,由netif_receive_skb传给相应的网络层接收函数。process_backlog的处理时间也有1个时钟嘀嗒的限制,同时一次处理的数据报的数量不得超过backlog_dev->quota和netdev_budget两个值中较小的那个值,backlog_dev->quota由netif_rx_schedule初始化为全局变量weight_p的值,缺省为64,netdev_budget缺省为300。从代码可以看出,process_backlog一次处理最大数据报数量为64,而net_rx_action为300。weight_p和netdev_budget这两个值分别可以在文件/proc/sys/net/core/dev_weight和/proc/sys/net/core/netdev_budget中查看和修改。
    netif_receive_skb是链路层接收数据报的最后一站。它根据注册在全局数组ptype_all和ptype_base里的网络层数据报类型,把数据报递交给不同的网络层协议的接收函数(INET域中主要是ip_rcv和arp_rcv)。

source和export

source命令用法:
source FileName
作用:在当前bash环境下读取并执行FileName中的命令。
注:该命令通常用命令"."来替代。
如:source .bash_rc 与 . .bash_rc 是等效的。
注意:source命令与shell scripts的区别是,
source在当前bash环境下执行命令,而scripts是启动一个子shell来执行命令。这样如果把设置环境变量(或alias等等)的命令写进scripts中,就只会影响子shell,无法改变当前的BASH,所以通过文件(命令列)设置环境变量时,要用source 命令。 
 
当我修改了/etc/profile文件,我想让它立刻生效,而不用重新登录;这时就想到用source命令,如:source /etc/profile
对source进行了学习,并且用它与sh 执行脚本进行了对比,现在总结一下。
 
source命令:
source命令也称为"点命令",也就是一个点符号(.),是bash的内部命令。
功能:使Shell读入指定的Shell程序文件并依次执行文件中的所有语句
source命令通常用于重新执行刚修改的初始化文件,使之立即生效,而不必注销并重新登录。
用法:
source filename 或 . filename
source命令(从 C Shell 而来)是bash shell的内置命令;点命令(.),就是个点符号(从Bourne Shell而来)是source的另一名称。
 
source filename 与 sh filename 及./filename执行脚本的区别在那里呢?
1.当shell脚本具有可执行权限时,用sh filename与./filename执行脚本是没有区别得。./filename是因为当前目录没有在PATH中,所有"."是用来表示当前目录的。
2.sh filename 重新建立一个子shell,在子shell中执行脚本里面的语句,该子shell继承父shell的环境变量,但子shell新建的、改变的变量不会被带回父shell,除非使用export。
3.source filename:这个命令其实只是简单地读取脚本里面的语句依次在当前shell里面执行,没有建立新的子shell。那么脚本里面所有新建、改变变量的语句都会保存在当前shell里面。
 
 
举例说明:
1.新建一个test.sh脚本,内容为:A=1
2.然后使其可执行chmod +x test.sh
3.运行sh test.sh后,echo $A,显示为空,因为A=1并未传回给当前shell
4.运行./test.sh后,也是一样的效果
5.运行source test.sh 或者 . test.sh,然后echo $A,则会显示1,说明A=1的变量在当前shell中
 
 
export是将自定义变量变成系统环境变量
所以得出的结论是:1、执行脚本时是在一个子shell环境运行的,脚本执行完后该子shell自动退出;2、一个shell中的系统环境变量才会被复制到子shell中(用export定义的变量);3、一个shell中的系统环境变量只对该shell或者它的子shell有效,该shell结束时变量消失(并不能返回到父shell中)。3、不用export定义的变量只对该shell有效,对子shell也是无效的。

TSO GSO

Large segment offload

From Wikipedia, the free encyclopedia

Dialogue box showing offload TCP segmentation settings for an Intel Pro 1000 NIC
In computer networkinglarge segment offload (LSO) is a technique for increasing outbound throughput of high-bandwidth network connections by reducing CPU overhead. It works by queuing up large buffers and letting thenetwork interface card (NIC) split them into separate packets. The technique is also called TCP segmentation offload (TSO) when applied to TCP, or generic segmentation offload (GSO).
The inbound counterpart of large segment offload is large receive offload (LRO).

[edit]Operation

When large chunks of data are to be sent over a computer network, they need to be first broken down to smaller segments that can pass through all the network elements like routers and switches between the source and destination computers. This process is referred to as segmentation. Segmentation is often done by the TCP protocol in the host computer. Offloading this work to the NIC is called TCP segmentation offload (TSO).
For example, a unit of 64KB (65,536 bytes) of data is usually segmented to 46 segments of 1448 bytes each before it is sent over the network through the NIC. With some intelligence in the NIC, the host CPU can hand over the 64 KB of data to the NIC in a single transmit request, the NIC can break that data down into smaller segments of 1448 bytes, add the TCP, IP, and data link layer protocol headers -- according to a template provided by the host's TCP/IP stack -- to each segment, and send the resulting frames over the network. This significantly reduces the work done by the CPU. Many new NICs on the market today support TSO.
Some network cards implement TSO generically enough that it can be used for offloading fragmentation of other transport layer protocols, or by doing IP fragmentation for protocols that don't support fragmentation by themselves, such as UDP.

[edit]See also

LINUX协议栈详解 协议处理

协议处理,主要介绍了从驱动中获取数据后,如何进行分发给不同的协议处理,包括IP协议、ARP协议等处理。

__netif_receive_skb协议处理的开始,主要的数据结构是ptype_all和ptype_base,其中ptype_all是一个链表结构,ptype_base则是一个数组,并通过hash来type来实现索引。

  1. list_for_each_entry_rcu(ptype, &ptype_all, list) {  
  2.         if (!ptype->dev || ptype->dev == skb->dev) {  
  3.             if (pt_prev)  
  4.                 ret = deliver_skb(skb, pt_prev, orig_dev);  
  5.             pt_prev = ptype;  
  6.         }  
  7.     }  
其中ptype_all通常用于例如抓包之类的数据处理,也就是不管什么数据包都会被接收。


  1. type = skb->protocol;  
  2.     list_for_each_entry_rcu(ptype,  
  3.             &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {  
  4.         if (ptype->type == type &&  
  5.             (ptype->dev == null_or_dev || ptype->dev == skb->dev ||  
  6.              ptype->dev == orig_dev)) {  
  7.             if (pt_prev)  
  8.                 ret = deliver_skb(skb, pt_prev, orig_dev);  
  9.             pt_prev = ptype;  
  10.         }  
  11.     }  
ptype_base则是通过具体的type来索引处理函数。

现在有几个问题,

1,什么时候注册的?

注册是通过dev_add_pack来实现的,

IP协议 

  1. static struct packet_type ip_packet_type __read_mostly = {  
  2.     .type = cpu_to_be16(ETH_P_IP),  
  3.     .func = ip_rcv,  
  4.     .gso_send_check = inet_gso_send_check,  
  5.     .gso_segment = inet_gso_segment,  
  6.     .gro_receive = inet_gro_receive,  
  7.     .gro_complete = inet_gro_complete,  
  8. };  

在inet_init调用dev_add_pack来注册。

ARP协议

  1. static struct packet_type arp_packet_type __read_mostly = {  
  2.     .type = cpu_to_be16(ETH_P_ARP),  
  3.     .func = arp_rcv,  
  4. };  

所有处理的协议号

  1. /* 
  2.  *  These are the defined Ethernet Protocol ID's. 
  3.  */  
  4.   
  5. #define ETH_P_LOOP  0x0060      /* Ethernet Loopback packet */  
  6. #define ETH_P_PUP   0x0200      /* Xerox PUP packet     */  
  7. #define ETH_P_PUPAT 0x0201      /* Xerox PUP Addr Trans packet  */  
  8. #define ETH_P_IP    0x0800      /* Internet Protocol packet */  
  9. #define ETH_P_X25   0x0805      /* CCITT X.25           */  
  10. #define ETH_P_ARP   0x0806      /* Address Resolution packet    */  
  11. #define ETH_P_BPQ   0x08FF      /* G8BPQ AX.25 Ethernet Packet  [ NOT AN OFFICIALLY REGISTERED ID ] */  
  12. #define ETH_P_IEEEPUP   0x0a00      /* Xerox IEEE802.3 PUP packet */  
  13. #define ETH_P_IEEEPUPAT 0x0a01      /* Xerox IEEE802.3 PUP Addr Trans packet */  
  14. #define ETH_P_DEC       0x6000          /* DEC Assigned proto           */  
  15. #define ETH_P_DNA_DL    0x6001          /* DEC DNA Dump/Load            */  
  16. #define ETH_P_DNA_RC    0x6002          /* DEC DNA Remote Console       */  
  17. #define ETH_P_DNA_RT    0x6003          /* DEC DNA Routing              */  
  18. #define ETH_P_LAT       0x6004          /* DEC LAT                      */  
  19. #define ETH_P_DIAG      0x6005          /* DEC Diagnostics              */  
  20. #define ETH_P_CUST      0x6006          /* DEC Customer use             */  
  21. #define ETH_P_SCA       0x6007          /* DEC Systems Comms Arch       */  
  22. #define ETH_P_TEB   0x6558      /* Trans Ether Bridging     */  
  23. #define ETH_P_RARP      0x8035      /* Reverse Addr Res packet  */  
  24. #define ETH_P_ATALK 0x809B      /* Appletalk DDP        */  
  25. #define ETH_P_AARP  0x80F3      /* Appletalk AARP       */  
  26. #define ETH_P_8021Q 0x8100          /* 802.1Q VLAN Extended Header  */  
  27. #define ETH_P_IPX   0x8137      /* IPX over DIX         */  
  28. #define ETH_P_IPV6  0x86DD      /* IPv6 over bluebook       */  
  29. #define ETH_P_PAUSE 0x8808      /* IEEE Pause frames. See 802.3 31B */  
  30. #define ETH_P_SLOW  0x8809      /* Slow Protocol. See 802.3ad 43B */  
  31. #define ETH_P_WCCP  0x883E      /* Web-cache coordination protocol  
  32.                      * defined in draft-wilson-wrec-wccp-v2-00.txt */  
  33. #define ETH_P_PPP_DISC  0x8863      /* PPPoE discovery messages     */  
  34. #define ETH_P_PPP_SES   0x8864      /* PPPoE session messages   */  
  35. #define ETH_P_MPLS_UC   0x8847      /* MPLS Unicast traffic     */  
  36. #define ETH_P_MPLS_MC   0x8848      /* MPLS Multicast traffic   */  
  37. #define ETH_P_ATMMPOA   0x884c      /* MultiProtocol Over ATM   */  
  38. #define ETH_P_LINK_CTL  0x886c      /* HPNA, wlan link local tunnel */  
  39. #define ETH_P_ATMFATE   0x8884      /* Frame-based ATM Transport  
  40.                      * over Ethernet  
  41.                      */  
  42. #define ETH_P_PAE   0x888E      /* Port Access Entity (IEEE 802.1X) */  
  43. #define ETH_P_AOE   0x88A2      /* ATA over Ethernet        */  
  44. #define ETH_P_TIPC  0x88CA      /* TIPC             */  
  45. #define ETH_P_1588  0x88F7      /* IEEE 1588 Timesync */  
  46. #define ETH_P_FCOE  0x8906      /* Fibre Channel over Ethernet  */  
  47. #define ETH_P_FIP   0x8914      /* FCoE Initialization Protocol */  
  48. #define ETH_P_EDSA  0xDADA      /* Ethertype DSA [ NOT AN OFFICIALLY REGISTERED ID ] */  
  49.   
  50. /* 
  51.  *  Non DIX types. Won't clash for 1500 types. 
  52.  */  
  53.   
  54. #define ETH_P_802_3 0x0001      /* Dummy type for 802.3 frames  */  
  55. #define ETH_P_AX25  0x0002      /* Dummy protocol id for AX.25  */  
  56. #define ETH_P_ALL   0x0003      /* Every packet (be careful!!!) */  
  57. #define ETH_P_802_2 0x0004      /* 802.2 frames         */  
  58. #define ETH_P_SNAP  0x0005      /* Internal only        */  
  59. #define ETH_P_DDCMP     0x0006          /* DEC DDCMP: Internal only     */  
  60. #define ETH_P_WAN_PPP   0x0007          /* Dummy type for WAN PPP frames*/  
  61. #define ETH_P_PPP_MP    0x0008          /* Dummy type for PPP MP frames */  
  62. #define ETH_P_LOCALTALK 0x0009      /* Localtalk pseudo type    */  
  63. #define ETH_P_CAN   0x000C      /* Controller Area Network      */  
  64. #define ETH_P_PPPTALK   0x0010      /* Dummy type for Atalk over PPP*/  
  65. #define ETH_P_TR_802_2  0x0011      /* 802.2 frames         */  
  66. #define ETH_P_MOBITEX   0x0015      /* Mobitex (kaz@cafe.net)   */  
  67. #define ETH_P_CONTROL   0x0016      /* Card specific control frames */  
  68. #define ETH_P_IRDA  0x0017      /* Linux-IrDA           */  
  69. #define ETH_P_ECONET    0x0018      /* Acorn Econet         */  
  70. #define ETH_P_HDLC  0x0019      /* HDLC frames          */  
  71. #define ETH_P_ARCNET    0x001A      /* 1A for ArcNet :-)            */  
  72. #define ETH_P_DSA   0x001B      /* Distributed Switch Arch. */  
  73. #define ETH_P_TRAILER   0x001C      /* Trailer switch tagging   */  
  74. #define ETH_P_PHONET    0x00F5      /* Nokia Phonet frames          */  
  75. #define ETH_P_IEEE802154 0x00F6     /* IEEE802.15.4 frame       */  
  76. #define ETH_P_CAIF  0x00F7      /* ST-Ericsson CAIF protocol    */  

2,什么时候赋值skb->protocol

由驱动负责给protocol赋值,且通过eth_type_trans赋值,同时skb->pkt_type也是在这个函数中赋值的。

  1. /* Packet types */  
  2.   
  3. #define PACKET_HOST     0       /* To us        */  
  4. #define PACKET_BROADCAST    1       /* To all       */  
  5. #define PACKET_MULTICAST    2       /* To group     */  
  6. #define PACKET_OTHERHOST    3       /* To someone else  */  
  7. #define PACKET_OUTGOING     4       /* Outgoing of any type */  

到现在我们知道了协议处理过程,也就很自然的进入了具体协议等研究,例如ARP的处理,IP数据包的处理等。

后面将深入讲解IP协议、路由过程(包括策略路由)和邻居系统,整一个是同一有机体。这之前我们还会深入一些TC系统的研究,对TC的研究有助于巩固对数据包发送过程的理解;还有一个重要的是网桥的研究,网桥其实作为单独一章来讲解,并不会非常的深入,只是大体讲解网桥的过程,曾经参与过项目,将网桥和VLAN是合在一起的,具体可以参考LISA(http://lisa.mindbit.ro/)。