Written by Eric Multanen firstname.lastname@example.org
DCB support on Linux has been emerging since about 2007. This document provides on overview of DCB technologies and describes how they have been implemented in Linux. Examples of how to configure and utilize DCB are provided.
The goal of DCB is to enhance Ethernet with a set of features which will allow the operation of multiple traffic types with differing requirements over the same link. For example, LAN traffic can operate with 'best effort' service and can tolerate occasional dropped packets. On the other hand, Storage traffic (e.g. Fibre Channel over Ethernet) is intolerant of dropped packets. Therefore, in order to transmit and receive both LAN and Storage traffic over the same Ethernet link, it must be possible to simultaneously provide 'best effort' and 'no drop' service. Since Ethernet is traditionally lossy, additional features are needed. These new features are provided by DCB.
DCB achieves its goal of partitioning traffic with differing requirements by providing different capabilities per packet priorities. Priority is signaled in a packet via the Priority Code Point (PCP) value in the VLAN tag. There are eight (0-7) priority values. This use of priority to partition the link also implies that traffic must use tagged VLANs in order to take advantage of the differentiation that DCB provides.
The transmit path of a network port is modeled as a set of queues called traffic classes which are numbered 0 through N-1, where N is in the range 1 to 8. The user priorities 0-7 are mapped to the set of traffic classes. Further details and definition of the default priority to traffic class mappings are provided in the IEEE Standard 802.1Q-2011.
A transmission selection algorithm is used to select which traffic class is chosen next to dequeue a frame and transmit to the LAN. The default transmission selection algorithm is the Strict Priority algorithm. This algorithm always selects the highest numbered traffic class which has frames to transmit first before a lower numbered traffic class is selected.
Since the Strict Priority algorithm could allow a traffic flow on a higher numbered traffic class to block a lower numbered traffic class from getting a chance to transmit, another traffic selection algorithm has been defined for DCB called the Enhanced Transmission Selection (ETS) algorithm. ETS works by assigning a percentage of available bandwidth to traffic classes. Available bandwidth is defined as the amount of bandwidth left after higher priority transmission algorithms (like Strict Priority) have executed. The bandwidth percentage allocated to an ETS traffic class is the guaranteed amount of available bandwidth which will be made available to that traffic class. If an ETS traffic class does not use all of the bandwidth allocated to it, then other ETS traffic classes may be able to exceed their bandwidth allocations.
ETS allows multiple traffic flows operating on different traffic classes to each receive their fair share of network bandwidth. Obviously, if the strict priority algorithm is used in combination with the ETS algorithm, then care should be taken to ensure that the traffic flows on the strict priority traffic classes are relatively low volume flows.
Further details about ETS can be found in IEEE Standard 802.1Qaz-2011 - which is an amendment to IEEE Standard 802.1Q-2011.
In order to avoid dropping packets for certain traffic flows a mechanism is needed for a link partner to inform its partner to stop transmitting before its receive buffers overflow and packets get dropped. Priority Flow Control (PFC) is a refinement of the pre-existing Ethernet flow control (or PAUSE) feature (see IEEE 802.3-2008 Annex 31B). Link based PAUSE operates by pausing all traffic on a link and can result in congestion spreading throughout the network - impacting many flows which are not the cause of the congestion. PFC helps mitigate these issues by allowing traffic flows on each user priority to be paused independently or not at all. Thus, issues like congestion spreading are constrained only to the traffic flows on a given user priority.
For example, traffic flows intolerant of dropped packets (such as FCoE) can be set up to use a priority for which PFC is enabled. Other LAN traffic can use the other priorities which may or may not have PFC enabled. In this way, if the PFC enabled priority needs to be paused, traffic on the other priorities can continue to flow without impediment.
PFC is defined per priority. As described above in the ETS section, priorities are mapped to traffic classes. So, when a port has less than eight traffic classes available, care should be made when the port is configured to ensure that priorities with PFC enabled and disabled are not both mapped to the same traffic class. Implementations may pause all traffic from a given traffic class, so a pause message for a given priority may result in all priorities mapped to the same traffic class as the paused priority to be paused. If the number of traffic classes is limited and multiple priorities are enabled for PFC, it may be necessary to map all of the PFC enabled priorities to the same traffic class.
Further details about PFC can be found in IEEE Standard 802.1Qbb-2011, an amendment to IEEE Standard 802.1Q-2011, and in IEEE Standard 802.3bd-2011.
The purpose of DCB is to configure ETS and PFC in such a way that traffic flows will move through the network with priorities that are configured to have properties appropriate for the type of traffic. DCB provides an additional feature which associates traffic types to specific user priorities. Traffic types or applications are identified using Ethertype or well-known port numbers (e.g. TCP or UDP). Additionally, a default priority can be specified by using an Ethertype value of zero.
Further details about the application priority feature of DCB can be found in IEEE Standard 802.1Qaz-2011 - which is an amendment to IEEE Standard 802.1Q-2011.
The DCB features ETS, PFC and Application Priority should be configured consistently in order for traffic flows to experience the desired service. The DCBX protocol is designed to help provide link level consistency - typically between a switch and the end-station.
DCBX operates by exchanging the local configuration of each DCB feature with the link partner using the Link Layer Discovery Protocol (LLDP). Information for each feature is packed into a Tag Length Value (TLV) structure and included in the LLDP packet. DCBX state machines then operate on the local and received remote configurations to determine a configuration that will be consistent.
The typical usage of DCBX is to configure the switch to provide the desired DCB configuration and configure the station to be willing to adopt the configuration provided by the switch.
Further details about DCBX can be found in IEEE Standard 802.1Qaz-2011 - which is an amendment to IEEE Standard 802.1Q-2011.
Further details about LLDP can be found in IEEE Standard 802.1AB-2009.
Before the DCB and DCBX features were standardized by IEEE 802.1, a couple pre-standard versions existed. The initial version of DCBX was created by Intel and Cisco and the document can be found here: http://www.intel.com/technology/eedc/index.htm
This initial version was then modified by a consortium of companies and resulted in the CEE version of DCB/DCBX. The resulting specifications were used as the baseline for the standardization work in IEEE 802.1. Many of the current DCB/DCBX implementations support CEE. The CEE specifications can be found here: http://www.ieee802.org/1/files/public/docs2008/dcb-baseline-contributions-1108-v1.01.pdf
The DCB implementation in Linux is comprised of the following components:
The key to operation of DCB in Linux is the skb_priority field in the skb structure. When DCB is in operation on a network interface, the skb_priority is mapped to the user priority value which becomes the PCP in the VLAN tag.
When an skb reaches netdev for a multi-traffic class interface, the skb_priority field is used to select which traffic class the skb is destined to based on the user priority to traffic class mapping which was provided to netdev by the driver. After the traffic class has been selected, a queue within the traffic class is selected. This can be performed by the driver's ndo_select_queue() routine, or it may be done by a hashing function in netdev.
When the skb reaches the transmit routine of the driver, the skb_priority value is used to set the PCP field of the VLAN tag. The skb will have arrived at the driver on a queue which is associated with the correct traffic class upon which it will be transmitted.
In order to get any specific traffic flow to transmit on a given user priority it is required to set the skb_priority field accordingly. There exist a variety of methods for setting the skb_priority field.
Each method will now be described in greater detail.
An application which opens a socket to send and receive network traffic can use the SO_PRIORITY socket option to set priority for the socket. This will become the skb_priority field in the skb. The application can first query the application priority settings via dcbnl and use the resulting priority, if found, in the setsockopt() call.
There are a couple drawbacks to this approach:
At this time, the open-iscsi package uses this SO_PRIORITY mechanism for setting the skb_priority in a DCB environment.
tbd - tc filters for setting the skb_priority currently do not operate properly with the mqprio qdisc (the qdisc which support the multi-traffic class multi-queue model). The tc filter mechanism does provide an action which can set the skb_priority, but this currently is executed after the skb_priority has been used to select a traffic class - so it's too late.
The net_prio cgroup mechanism provides a method for classifying and setting the priority of application packets which solves the issues called out in the section on using SO_PRIORITY. Application sockets are associated with a net_prio cgroup instead of the actual user priority. When the skb arrives at netdev and selection of the traffic class needs to be done, a lookup is performed to find the user priority that is associated with the destination network interface and net_prio cgroup.
This approach solves two problems:
The first step for using the net_prio cgroup is to ensure that the kernel is configured to support net_prio cgroup. This is controlled by the kernel CONFIG_NETPRIO_CGROUP setting.
Once a kernel with net_prio cgroup support is running, the following steps will set up the net_prio cgroup subsystem for use:
mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/net_prio mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
Once this is completed the default net_prio cgroup is available in /sys/fs/cgroup/net_prio
Key files in this (or any net_prio) directory are:
eth1 0 eth2 0 etc.
Just before skb's are assigned to a specific traffic class by netdev, the net_prio cgroup is looked up using the value of prioidx saved in the socket and then the prioidx map for the specific net_prio cgroup is used to identify the user priority associated with the interface. This value is then saved in the skb_priority IF the skb_priority has not already been set to a non-zero value by some other means.
The net_prio.prioidx mapping can be easily managed by echoing in a new mapping:
echo "eth1 3" > net_prio.ifpriomap
Note: In the case of a vlan interface the full name needs to be used: echo “eth1.vlan_id 3” > net_prio.ifpriomap
Processes can be assigned to a different net_prio cgroup by adding the PID to the specific net_prio's task file:
echo <PID> > tasks
The PID is automatically removed from the old net_prio cgroup tasks file.
Another method for starting a process in a net_prio cgroup is to use the cgexec command:
cgexec -g [<controllers>:<path>] command [arguments]
cgexec -g net_prio:my_net_prio_group ping google.com
Install the libcgroup package to get cgexec and other cgroup management programs.
New net_prio cgroups can be made from the default net_prio cgroup:
cd /sys/fs/cgroup/net_prio mkdir my_net_prio_group
The cgdcbxd program provides a way to automate the creation of net_prio cgroups using the application priority information maintained by dcbnl. cgdcbxd is currently located at:
Before running cgdcbxd, the following two steps should be performed:
When cgdcbxd executes, it will read the application priority list maintained by dcbnl and create corresponding net_prio cgroups. It will continue to monitor dcbnl and update the net_prio cgroups whenever changes are detected from dcbnl. The cgroups created by cgdcbxd use the following naming convention:
The selector value are the same values defined for the DCBX application priority selector values, which are:
1 - Ethertype 2 - Well known port number over TCP or SCTP 3 - Well known port number over UDP or DCCP 4 - Well known port number over TCP, SCTP, UDP or DCCP
The protocol value is either a 4 digit hexadecimal value if the selector is Ethertype, or a decimal number for the well known port selectors. For example, cgdcbxd could create the following net_prio cgroups:
cgdcb-1-8906 indicates FCoE as identified by Ethertype 0x8906 cgdcb-4-3260 indicates iSCSI as identified by well known port 3260
The cgroups created by cgdcbxd will be removed when cgdcbxd is stopped.
Use of cgdcbxd and lldpad (DCBX) pretty much automates the DCBX application priority feature. DCBX will receive application priority TLV's from the switch and update the dcbnl list. cgdcbxd then automatically creates corresponding cgroups. The only other steps which need to be performed are:
(see “Using cgrulesengd” below)
One method of configuring the association of applications to net_prio cgroups is to use the cgrulesengd program (also part of the libcgroup package). This is a general purpose cgroup configuration program which uses a configuration file (/etc/cgrules.conf) to configure cgroup associations. Lines in cgrules.conf have the following format:
<user>:<process name> <controllers> <destination>
A simple rule to add iscsid to the iSCSI cgroup as created by cgdcbxd would be:
*:iscsid net_prio cgdcb-4-3260
Take a look at the comments provided in the sample cgrules.conf file provided with the cgrulesengd program for more details on how to configure cgroups. A special note on the VLAN egress map
VLANs on Linux have a feature which allows configuration of the priority egress mapping - which is a mapping of skb_priority to user priority. This mapping is configurable via VLAN interface configuration utilities. By default, all skb_priorities map to user priority zero. The usage of the skb_priority, as described in this document, implies that the skb_priority to user priority mapping for values of skb_priority 0..7 is one to one. Some further thought and work needs to go into the issue of how to integrate the usage of the skb_priority to user priority mapping of the VLAN interface with the usage of skb_priority by the features supporting DCB.