当前位置:首页 >> >>

Toward the accurate identification of network applications

Toward the Accurate Identi?cation of Network Applications
Andrew W. Moore1 and Konstantina Papagiannaki2
1 2

University of Cambridge, andrew.moore@cl.cam.ac.uk Intel Research, Cambridge, dina.papagiannaki@intel.com

Abstract. Well-known port numbers can no longer be used to reliably identify network applications. There is a variety of new Internet applications that either do not use well-known port numbers or use other protocols, such as HTTP, as wrappers in order to go through ?rewalls without being blocked. One consequence of this is that a simple inspection of the port numbers used by ?ows may lead to the inaccurate classi?cation of network tra?c. In this work, we look at these inaccuracies in detail. Using a full payload packet trace collected from an Internet site we attempt to identify the types of errors that may result from portbased classi?cation and quantify them for the speci?c trace under study. To address this question we devise a classi?cation methodology that relies on the full packet payload. We describe the building blocks of this methodology and elaborate on the complications that arise in that context. A classi?cation technique approaching 100% accuracy proves to be a labor-intensive process that needs to test ?ow-characteristics against multiple classi?cation criteria in order to gain su?cient con?dence in the nature of the causal application. Nevertheless, the bene?ts gained from a content-based classi?cation approach are evident. We are capable of accurately classifying what would be otherwise classi?ed as unknown as well as identifying tra?c ?ows that could otherwise be classi?ed incorrectly. Our work opens up multiple research issues that we intend to address in future work.



Network tra?c monitoring has attracted a lot of interest in the recent past. One of the main operations performed within such a context has to do with the identi?cation of the di?erent applications utilising a network’s resources. Such information proves invaluable for network administrators and network designers. Only knowledge about the tra?c mix carried by an IP network can allow e?cient design and provisioning. Network operators can identify the requirements of di?erent users from the underlying infrastructure and provision appropriately. In addition, they can track the growth of di?erent user populations and design the network to accommodate the diverse needs. Lastly, accurate identi?cation
Andrew Moore thanks the Intel Corporation for its generous support of his research fellowship

of network applications can shed light on the emerging applications as well as possible mis-use of network resources. The state of the art in the identi?cation of network applications through tra?c monitoring relies on the use of well known ports: an analysis of the headers of packets is used to identify tra?c associated with a particular port and thus of a particular application [1–3]. It is well known that such a process is likely to lead to inaccurate estimates of the amount of tra?c carried by di?erent applications given that speci?c protocols, such as HTTP, are frequently used to relay other types of tra?c, e.g., the NeoTeris VLAN over HTTP product. In addition, emerging services typically avoid the use of well known ports, e.g., some peer-to-peer applications. This paper describes a method to address the accurate identi?cation of network applications in the presence of packet payload information3 . We illustrate the bene?ts of our method by comparing a characterisation of the same period of network tra?c using ports-alone and our content-based method. This comparison allows us to highlight how di?erences between port and content-based classi?cation may arise. Having established the bene?ts of the proposed methodology, we proceed to evaluate the requirements of our scheme in terms of complexity and amount of data that needs to be accessed. We demonstrate the trade-o?s that need to be addressed between the complexity of the di?erent classi?cation mechanisms employed by our technique and the resulting classi?cation accuracy. The presented methodology is not automated and may require human intervention. Consequently, in future work we intend to study its requirements in terms of a real-time implementation. The remainder of the paper is structured as follows. In Section 2 we present the data used throughout this work. In Section 3 we describe our content-based classi?cation technique. Its application is shown in Section 4. The obtained results are contrasted against the outcome of a port-based classi?cation scheme. In Section 5 we describe our future work.


Collected Data

This work presents an application-level approach to characterising network traf?c. We illustrate the bene?ts of our technique using data collected by the highperformance network monitor described in [5]. The site we examined hosts several Biology-related facilities, collectively referred to as a Genome Campus. There are three institutions on-site that employ about 1,000 researchers, administrators and technical sta?. This campus is connected to the Internet via a full-duplex Gigabit Ethernet link. It was on this connection to the Internet that our monitor was placed. Tra?c was monitored for a full 24 hour, week-day period and for both link directions.

Packet payload for the identi?cation of network applications is also used in [4]. Nonetheless, no speci?c details are provided by [4] on the implementation of the system thus making comparison infeasible. No further literature was found by the authors regarding that work.

Total Total Packets MBytes Total 573,429,697 268,543 As percentage of Total TCP 94.819 98.596 ICMP 3.588 0.710 UDP 1.516 0.617 OTHER 0.077 0.077 Table 1. Summary of tra?c analysed

Brief statistics on the tra?c data collected are given in Table 1. Other protocols were observed in the trace, namely IPv6-crypt, PIM, GRE, IGMP, NARP and private encryption, but the largest of them accounted for fewer than one million packets (less than 0.06%) over the 24 hour period and the total of all OTHER protocols was fewer than one and a half million packets. All percentage values given henceforth are from the total of UDP and TCP packets only.


Overview of Content-based classi?cation

Our content-based classi?cation scheme can be viewed as an iterative procedure whose target is to gain su?cient con?dence that a particular tra?c stream is caused by a speci?c application. To achieve such a goal our classi?cation method operates on tra?c ?ows and not packets. Grouping packets into ?ows allows for more-e?cient processing of the collected information as well the acquisition of the necessary context for an appropriate identi?cation of the network application responsible for a ?ow. Obviously, the ?rst step we need to take is that of aggregating packets into ?ows according to their 5-tuple. In the case of TCP, additional semantics can also allow for the identi?cation of the start and end time of the ?ow. The fact that we observe tra?c in both directions allows classi?cation of all nearly ?ows on the link. A tra?c monitor on a unidirectional link can identify only those applications that use the monitored link for their datapath. One outcome of this operation is the identi?cation of unusual or peculiar ?ows — speci?cally simplex ?ows. These ?ows consist of packets exchanged between a particular port/protocol combination in only one direction between two hosts. A common cause of a simplex ?ow is that packets have been sent to an invalid or non-responsive destination host. The data of the simplex ?ows were not discarded, they were classi?ed — commonly identi?ed as carrying worm and virus attacks. The identi?cation and removal of simplex ?ows (each ?ow consisting of between three and ten packets sent over a 24-hour period) allowed the number of unidenti?ed ?ows that needed further processing to be signi?cantly reduced.

The second step of our method iteratively tests ?ow characteristics against di?erent criteria until su?cient certainty has been gained as to the identity of the application. Such a process consists of nine di?erent identi?cation submethods. We describe these mechanisms in the next section. Each identi?cation sub-method is followed by the evaluation of the acquired certainty in the candidate application. Currently this is a (labour-intensive) manual process. 3.2 Identi?cation Methods

The nine distinct identi?cation methods applied by our scheme are listed in Table 2. Alongside each method is an example application that we could identify using this method. Each one tests a particular property of the ?ow attempting to obtain evidence of the identity of the causal application.
Identi?cation Method Example Port-based classi?cation (only) — Packet Header (including I) simplex ?ows Single packet signature Many worm/virus Single packet protocol IDENT Signature on the ?rst KByte P2P ?rst KByte Protocol SMTP Selected ?ow(s) Protocol FTP (All) Flow Protocol VNC, CVS Host history Port-scanning Table 2. Methods of ?ow identi?cation.


Method I classi?es ?ows according to their port numbers. This method represents the state of the art and requires access only to the part in the packet header that contains the port numbers. Method II relies on access to the entire packet header for both tra?c directions. It is this method that is able to identify simplex ?ows and signi?cantly limit the number of ?ows that need to go through the remainder of the classi?cation process. Methods III to VIII examine whether a ?ow carries a well-known signature or follows well-known protocol semantics. Such operations are accompanied by higher complexity and may require access to more than a single packet’s payload. We have listed the di?erent identi?cation mechanisms in terms of their complexity and the amount of data they require in Figure 1. According to our experience, speci?c ?ows may be classi?ed positively from their ?rst packet alone. Nonetheless, other ?ows may need to be examined in more detail and a positive identi?cation may be feasible once up to 1 KByte of their data has been observed4 . Flows that have not been

The value of 1 KByte has been experimentally found to be an upper bound for the amount of packet information that needs to be processed for the identi?cation of several applications making use of signatures. In future work, we intend to address

classi?ed at this stage will require inspection of the entire ?ow payload and we separate such a process into two distinct steps. In the ?rst step (Method VII) we perform full-?ow analysis for a subset of the ?ows that perform a controlfunction. In our case FTP appeared to carry a signi?cant amount of the overall tra?c and Method VII was applied only to those ?ows that used the standard FTP control port. The control messages were parsed and further context was obtained that allowed us to classify more ?ows in the trace. Lastly, if there are still ?ows to be classi?ed, we analyse them using speci?c protocol information attributing them to their causal application using Method VIII.


Flow (selected) 1st KByte Packet


Fig. 1. Requirements of identi?cation methods.

In our classi?cation technique we will apply each identi?cation method in turn and in such a way that the more-complex or more-data-demanding methods (as shown in Figure 1) are used only if no previous signature or protocol method has generated a match. The outcome of this process may be that (i) we have positively identi?ed a ?ow to belong to a speci?c application, (ii) a ?ow appears to agree with more than one application pro?le, or (iii) no candidate application has been identi?ed. In our current methodology all three cases will trigger manual intervention in order to validate the accuracy of the classi?cation, resolve cases where multiple criteria have generated a match or inspect ?ows that have not matched any identi?cation criteria. We describe our validation approach in more detail in Section 3.4.
the exact question of what is the necessary amount of payload one needs to capture in order to identify di?erent types of applications.




Amount of Data

Flow (all)


The successful identi?cation of speci?c ?ows caused by a particular network application reveals important information about the hosts active in our trace. Our technique utilises this information to build a knowledge base for particular host/port combinations that can be used to validate future classi?cation by testing conformance with already-observed host roles (Method IX). One outcome of this operation is the identi?cation of hosts performing port scanning where a particular destination host is contacted from the same source host on many sequential port numbers. These ?ows evidently do not belong to a particular application (unless port scanning is part of the applications looked into). For a di?erent set of ?ows, this process validated the streaming audio from a pool of machines serving a local broadcaster. Method IX can be further enhanced to use information from the host name as recorded in the DNS. While we used this as a process-of-last-resort (DNS names can be notoriously un-representative), DNS names in our trace did reveal the presence of an HTTP proxy, a Mail exchange server and a VPN endpoint operating over a TCP/IP connection. 3.3 Classi?cation Approach

An illustration of the ?ow through the di?erent identi?cation sub-methods, as employed by our approach, is shown in Figure 2. In the ?rst step we attempt to reduce the number of ?ows to be further processed by using context obtained through previous iterations. Speci?c ?ows in our data can be seen as “child” connections arising from “parent” connections that precede them. One such example is a web browser that initiates multiple connections in order to retrieve parts of a single web page. Having parsed the “parent” connection allows us to immediately identify the “child” connections and classify them to the causal web application.


Is Flow Result of Another Application ? YES YES YES YES YES Failed Verify

NO Tag flows with known ports 1st pkt "Well Known" Signature ? NO

1st pkt "Well Known" Protocol ? NO

1st KB "Well Known" Signature ?

NO 1st KB "Well Known" Protocol ?


Flow Contains NO Known Protocol? (selected) Flow Contains Known Protocol? (all) NO




(using IX among other mechanisms)

Passed Verify

Manual Intervention


Fig. 2. Classi?cation procedure.

A second example, that has a predominant e?ect in our data, is passive FTP. Parsing the “parent” FTP session (Method VIII) allows the identi?cation

of the subsequent “child” connection that may be established toward a di?erent host at a non-standard port. Testing whether a ?ow is the result of an alreadyclassi?ed ?ow at the beginning of the classi?cation process allows for the fast characterisation of a network ?ow without the need to go through the remainder of the process. If the ?ow is not positively identi?ed in the ?rst stage then it goes through several additional classi?cation criteria. The ?rst mechanism examines whether a ?ow uses a well-known port number. While port-based classi?cation is prone to error, the port number is still a useful input into the classi?cation process because it may convey useful information about the identity of the ?ow. If no well-known port is used, the classi?cation proceeds through the next stages. However, even in the case when a ?ow is found to operate on a well-known port, it is tagged as well-known but still forwarded through the remainder of the classi?cation process. In the next stage we test whether the ?ow contains a known signature in its ?rst packet. At this point we will be able to identify ?ows that may be directed to well-known port numbers but carry non-legitimate tra?c as in the case of virus or attack tra?c. Signature-scanning is a process that sees common use within Intrusion Detection Systems such as snort [6]. It has the advantage that a suitable scanner is often optimised for string-matching while still allowing the expression of ?exible matching criteria. By scanning for signatures, applications such as web-servers operating on non-standard ports may be identi?ed. If no known signature has been found in the ?rst packet we check whether the ?rst packet of the ?ow conveys semantics of a well-known protocol. An example to that e?ect is IDENT which is a single packet IP protocol. If this test fails we look for well-known signatures in the ?rst KByte of the ?ow, which may require assembly of multiple individual packets. At this stage we will be able to identify peer-to-peer tra?c if it uses well known signatures. Tra?c due to SMTP will have been detected from the port-based classi?cation but only the examination of the protocol semantics within the ?rst KByte of the ?ow will allow for the con?dent characterisation of the ?ow. Network protocol analysis tools, such as ethereal [7], employ a number of such protocol decoders and may be used to make or validate a protocol identi?cation. Speci?c ?ows will still remain unclassi?ed even at this stage and will require inspection of their entire payload. This operation may be manual or automated for particular protocols. From our experience, focusing on the protocol semantics of FTP led to the identi?cation of a very signi?cant fraction of the overall traf?c limiting the unknown tra?c to less than 2%. At this point the classi?cation procedure can end. However, if 100% accuracy is to be approached we envision that the last stage of the classi?cation process may involve the manual inspection of all unidenti?ed ?ows. This stage is rather important since it is likely to reveal new applications. While labour-intensive, the individual examination of the remaining, unidenti?ed, ?ows caused the creation of a number of new signatures and protocol-templates that were then able to be used for identifying protocols such as PCAnywhere, the sdserver and CVS. This process also served

to identify more task-speci?c systems. An example of this was a host o?ering protocol-speci?c database services. On occasion ?ows may remain unclassi?ed despite this process; this takes the form of small samples (e.g., 1–2 packets) of data that do not provide enough information to allow any classi?cation process to proceed. These packets used unrecognised ports and rarely carried any payload. While such background noise was not zero in the context of classi?cation for accounting, Quality-of-Service, or resource planning, these amounts could be considered insigni?cant. The actual amount of data in terms of either packets or bytes that remained unclassi?ed represented less than 0.001% of the total. 3.4 Validation Process

Accurate classi?cation is complicated by the unusual use to which some protocols are put. As noted earlier, the use of one protocol to carry another, such as the use of HTTP to carry peer-to-peer application tra?c, will confuse a simple signature-based classi?cation system. Additionally, the use of FTP to carry an HTTP transaction log will similarly confuse signature matching. Due to these unusual cases the certainty of any classi?cation appears to be a di?cult task. Throughout the work presented in this paper validation was performed manually in order to approach 100% accuracy in our results. Our validation approach features several distinct methods. Each ?ow is tested against multiple classi?cation criteria. If this procedure leads to several criteria being satis?ed simultaneously, manual intervention can allow for the identi?cation of the true causal application. An example is the peerto-peer situation. Identifying a ?ow as HTTP does not suggest anything more than that the ?ow contains HTTP signatures. After applying all classi?cation methods we may conclude that the ?ow is HTTP alone, or additional signaturematching (e.g. identifying a peer-to-peer application) may indicate that the ?ow is the result of a peer-to-peer transfer. If the ?ow classi?cation results from a well-known protocol, then the validation approach tests the conformance of the ?ow to the actual protocol. An example of this procedure is the identi?cation of FTP PASV ?ows. A PASV ?ow can be valid only if the FTP control-stream overlaps the duration of the PASV ?ow — such cursory, protocol-based, examination allows an invalid classi?cation to be identi?ed. Alongside this process, ?ows can be further validated against the perceived function of a host, e.g., an identi?ed router would be valid to relay BGP whereas for a machine identi?ed as (probably) a desktop Windows box behind a NAT, concluding it was transferring BGP is unlikely and this potentially invalid classi?cation requires manual-intervention.



Given the large number of identi?ed applications, and for ease of presentation, we group applications into types according to their potential requirements from the

network infrastructure. Table 3 indicates ten such classes of tra?c. Importantly, the characteristics of the tra?c within each category is not necessarily unique. For example, the BULK category which is made up of ftp tra?c consists of both ftp control channel: data on both directions, and the ftp data channel which consists of a simplex ?ow of data for each object transferred.

Classi?cation Example Application BULK ftp DATABASE postgres, sqlnet, oracle, ingres INTERACTIVE ssh, klogin, rlogin, telnet MAIL imap, pop2/3, smtp SERVICES X11, dns, ident, ldap, ntp WWW www P2P KaZaA, BitTorrent, GnuTella MALICIOUS Internet work and virus attacks GAMES Half-Life MULTIMEDIA Windows Media Player, Real Table 3. Network tra?c allocated to each category

In Table 4 we compare the results of simple port-based classi?cation with content-based classi?cation. The technique of port-analysis, against which we compare our approach, is common industry practise (e.g., Cisco NetFlow or [1, 2]). UNKNOWN refers to applications which for port-based analysis are not readily identi?able. Notice that under the content-based classi?cation approach we had nearly no UNKNOWN tra?c; instead we have 5 new tra?c-classes detected. The tra?c we were not able to classify corresponds to a small number of ?ows. A limited number of ?ows provides a minimal sample of the application behavior and thus cannot allow for the con?dent identi?cation of the causal application. Table 4 shows that under the simple port-based classi?cation scheme based upon the IANA port assignments 30% of the carried bytes cannot be attributed to a particular application. Further observation reveals that the BULK traf?c is underestimated by approximately 20% while we see a di?erence of 6% in the WWW tra?c. However, the port-based approach does not only underestimate tra?c but for some classes, e.g., INTERACTIVE applications, it may over-estimate it. This means that tra?c ?ows can also be misidenti?ed under the port-based technique. Lastly, applications such as peer-to-peer and mal-ware appear to contribute zero tra?c in the port-based case. This is due to the port through which such protocols travel not providing a standard identi?cation. Such port-based estimation errors are believed to be signi?cant.

Port-Based Content-Based Packets Bytes Packets Bytes As a percentage of total tra?c BULK 46.97 45.00 65.06 64.54 DATABASE 0.03 0.03 0.84 0.76 GRID 0.03 0.07 0.00 0.00 INTERACTIVE 1.19 0.43 0.75 0.39 MAIL 3.37 3.62 3.37 3.62 SERVICES 0.07 0.02 0.29 0.28 WWW 19.98 20.40 26.49 27.30 UNKNOWN 28.36 30.43 <0.01 <0.01 MALICIOUS — IRC/CHAT — P2P — GAMES — MULTIMEDIA — Table 4. Contrasting port-based and — 1.10 — 0.44 — 1.27 — 0.17 — 0.22 Content-based 1.17 0.05 1.50 0.18 0.21 classi?cation.

Classi?cation Type


Examining Under and Over-estimation

Of the results in Table 4 we will concentrate on only a few example situations. The ?rst and most dominant di?erence is for BULK — tra?c created as a result of FTP. The reason is that port-based classi?cation will not be able to correctly identify a large class of (FTP) tra?c transported using the PASV mechanism. Content-based classi?cation is able to identify the causal relationship between the FTP control ?ow and any resulting data-transport. This means that tra?c that was formerly either of unknown origin or incorrectly classi?ed may be ascribed to FTP which is a tra?c source that will be consistently underestimated by port-based classi?cation. A comparison of values for MAIL, a category consisting of the SMTP, IMAP, MAPI and POP protocols, reveals that it is estimated with surprising accuracy in both cases. Both the number of packets and bytes transferred is unchanged between the two classi?cation techniques. We also did not ?nd any other nonMAIL tra?c present on MAIL ports. We would assert that the reason MAIL is found exclusively on the commonly de?ned ports, while no other MAIL transactions are found on other ports, is that MAIL must be exchanged with other sites and other hosts. MAIL relies on common, Internet-wide standards for port and protocol assignment. No single site could arbitrarily change the ports on which MAIL is exchanged without e?ectively cutting itself o? from exchanges with other Internet sites. Therefore, MAIL is a tra?c source that, for quantifying tra?c exchanged with other sites at least, may be accurately estimated by port-based classi?cation. Despite the fact that such an e?ect was not pronounced in the analysed data set, port-based classi?cation can also lead to over-estimation of the amount

of tra?c carried by a particular application. One reason is that mal-ware or attack tra?c may use the well-known ports of a particular service, thus in?ating the amount of tra?c attributed to that application. In addition, if a particular application uses another application as a relay, then the tra?c attributed to the latter will be in?ated by the amount of tra?c of the former. An example of such a case is peer-to-peer tra?c using HTTP to avoid blocking by ?rewalls, an e?ect that was not present in our data. In fact, we notice that under the content-based approach we can attribute more tra?c to WWW since our data included web servers operating on non-standard ports that could not be detected under the port-based approach. Clearly this work leads to an obvious question of how we know that our content-based method is correct. We would emphasise that it was only through the labour-intensive examining of all data-?ows along with numerous exchanges with system administrators and users of the examined site that we were able to arrive at a system of su?cient accuracy. We do not consider that such a laborious process would need to be repeated for the analysis of similar tra?c pro?les. However, the identi?cation of new types of applications will require a more limited examination of a future, unclassi?able anomaly. 4.2 Overheads of content-based analysis

Alongside a presentation of the e?ectiveness of the content-based method we present the overheads this method incurs. For our study we were able to iterate through tra?c multiple times, studying data for many months after its collection. Clearly, such a labour-intensive approach would not be suitable if it were to be used as part of real-time operator feedback. We emphasise that while performing this work, we built a considerable body of knowledge applicable to future studies. The data collected for one monitor can be reapplied for future collections made at that location. Additionally, while speci?c host information may quickly become out-of-date, the techniques for identifying applications through signatures and protocol-?tting continue to be applicable. In this way historical data becomes an a-priori that can assist in the decision-making process of the characterisation for each analysis of the future. Table 5 indicates the relationship between the complexity of analysis and the quantity of data we could positively identify — items are ordered in the table as increasing levels of complexity. The Method column refers to methods listed in Table 2 in Section 3. Currently our method employs packet-header analysis and host-pro?le construction for all levels of complexity. Signature matching is easier to implement and perform than protocol matching due to its application of static string matching. Analysis that is based upon a single packet (the ?rst packet) is inherently less complex than analysis based upon (up to) the ?rst KByte. The ?rst KByte may require reassembly from the payload of multiple packets. Finally, any form of ?ow-analysis is complicated although this will clearly reduce the overheads of analysis if the number of ?ows that require parsing is limited.

Method UNKNOWN Data % Correctly Identi?ed I II III IV V VI VII VIII IX Packets Bytes Packets Bytes ? 28.36 30.44 71.03 69.27 ? ? ? 27.35 30.33 72.05 69.38 ? ? ? ? 27.35 30.32 72.05 69.39 ? ? ? ? ? 27.12 30.09 72.29 69.62 ? ? ? ? ? ? 25.72 28.43 74.23 71.48 ? ? ? ? ? ? ? 19.11 21.07 80.84 78.84 ? ? ? ? ? ? ? ? 1.07 1.22 98.94 98.78 ? ? ? ? ? ? ? ? ? <0.01 <0.01 >99.99 >99.99 Table 5. Analysis method compared against percentage of UNKNOWN and correctly identi?ed data.

Table 5 clearly illustrates the accuracy achieved by applying successivelymore-complicated characterisation techniques. The correctness of classi?cation reported in Table 5 is computed by comparing the results using that method and the results using the content-based methodology. Importantly, the quantity of UNKNOWN tra?c is not simply the di?erence between total and identi?ed tra?c. Tra?c quanti?ed as UNKNOWN has no category and does not account for tra?c that is mis-classi?ed. It may be considered the residual following each classi?cation attempt. Table 5 shows that port-based classi?cation is actually capable of correctly classifying 69% of the bytes. Contrasting this value with the known tra?c in Table 4 further demonstrates that the mis-identi?ed amount of tra?c is rather limited. Nonetheless, 31% of the tra?c is unknown. Applying host-speci?c knowledge is capable of limiting the unknown tra?c by less than 1% and signature and application semantics analysis based on the ?rst packet of the ?ow provides an additional bene?t of less than 1%. It’s only after we observe up to 1 KByte of the ?ow that we can increase the correctly-identi?ed tra?c from approximately 70% to almost 79%. Application of mechanism VII can further increase this percentage to 98%. In Table 2 we have listed example applications that are correctly identi?ed when the particular mechanism is applied. In summary, we notice that port-based classi?cation can lead to the positive identi?cation of a signi?cant amount of the carried tra?c. Nonetheless, it contains errors that can be detected only through the application of a content-based technique. Our analysis shows that typically the greatest bene?t of applying such a technique, unfortunately, comes from the most complicated mechanisms. If a site contains a tra?c mix biased toward the harder-to-detect applications, then these inaccuracies may have even more adverse consequences.


Summary and Future Work

Motivated by the need for more accurate identi?cation techniques for network applications, we presented a framework for tra?c characterisation in the presence

of packet payload. We laid out the principles for the correct classi?cation of network tra?c. Such principles are captured by several individual building blocks that, if applied iteratively, can provide su?cient con?dence in the identity of the causal application. Our technique is not automated due to the fact that a particular Internet ?ow could satisfy more than one classi?cation criterion or it could belong to an emerging application having behaviour that is not yet common knowledge. We collected a full payload packet traces from an Internet site and compared the results of our content-based scheme against the current state of the art — the port-based classi?cation technique. We showed that classifying tra?c based on the usage of well-known ports leads to a high amount of the overall tra?c being unknown and a small amount of tra?c being misclassi?ed. We quanti?ed these inaccuracies for the analysed packet trace. We then presented an analysis of the accuracy-gain as a function of the complexity introduced by the di?erent classi?cation sub-methods. Our results show that simple port-based classi?cation can correctly identify approximately 70% of the overall tra?c. Application of increasingly complex mechanisms can approach 100% accuracy with great bene?ts gained even through the analysis of up to 1 KByte of a tra?c ?ow. Our work should be viewed as being at an early stage and the avenues for future research are multiple. One of the fundamental questions that need investigation is how such a system could be implemented for real-time operation. We would argue that an adapted version of the architecture described in [5], which currently performs on-line ?ow analysis as part of its protocol-parsing and feature-compression, would be a suitable system. Such an architecture overcomes the (potential) over-load of a single monitor by employing a method work-load sharing among multiple nodes. This technique incorporates dynamic load-distribution and assumes that a single ?ow will not overwhelm a single monitoring node. In our experience such a limitation is su?ciently ?exible as to not be concerning. We clearly need to apply our technique to other Internet locations. We need to identify how applicable our techniques are for other mixes of user tra?c and when our monitoring is subject to other limitations. Examples of such limitations include having access to only unidirectional tra?c or to a sample of the data. Both these situations are common for ISP core networks and for multi-homed sites. We already identify that the ?rst phase of identi?cation and culling of simplex ?ows would not be possible if the only data available corresponded to a single link direction. We emphasise that application identi?cation from tra?c data is not an easy task. Simple signature matching may not prove adequate in cases where multiple classi?cation criteria seem to be satis?ed simultaneously. Validation of the candidate application for a tra?c ?ow in an automated fashion is an open issue. Further research needs to be carried out in this direction. Moreover, we envision that as new applications appear in the Internet there will always be cases when manual intervention will be required in order to gain understanding of its nature.

Lastly, in future work we intend to address the issue of how much information needs to be accessible by a tra?c classi?er for the identi?cation of di?erent network applications. Our study has shown that in certain cases one may need access to the entire ?ow payload in order to arrive to the correct causal application. Nonetheless, if system limitations dictate an upper bound on the captured information, then the knowledge of the application(s) that will evade identi?cation is essential. A technical report describing the (manual) process we used is provided in [8]. Thanks We gratefully acknowledge the assistance of Geo? Gibbs, Tim Granger, and Ian Pratt during the course of this work. We also thank Michael Dales, Jon Crowcroft, Tim Gri?n and Ralphe Neill for their feedback.

1. Moore, D., Keys, K., Koga, R., Lagache, E., kc Cla?y: CoralReef software suite as a tool for system and network administrators. In: Proceedings of the LISA 2001 15th Systems Administration Conference. (2001) 2. Connie Logg and Les Cottrell: Characterization of the Tra?c between SLAC and the Internet (2003) http://www.slac.stanford.edu/comp/net/slac-net?ow/html/SLACnet?ow.html. 3. Fraleigh, C., Moon, S., Lyles, B., Cotton, C., Khan, M., Moll, D., Rockell, R., Seely, T., Diot, C.: Packet-level tra?c measurements from the sprint IP backbone. IEEE Network (2003) 6–16 4. Choi, T., Kim, C., Yoon, S., Park, J., Lee, B., Kim, H., Chung, H., Jeong, T.: Content-aware Internet Application Tra?c Measurement and Analysis. In: IEEE/IFIP Network Operations & Management Symposium (NOMS) 2004. (2004) 5. Moore, A., Hall, J., Kreibich, C., Harris, E., Pratt, I.: Architecture of a Network Monitor. In: Passive & Active Measurement Workshop 2003 (PAM2003). (2003) 6. Roesch, M.: Snort - Lightweight Intrusion Detection for Networks. In: USENIX 13th Systems Administration Conference — LISA ’99, Seattle, WA (1999) 7. Orebaugh, A., Morris, G., Warnicke, E., Ramirez, G.: Ethereal Packet Sni?ng. Syngress Publishing, Rockland, MA (2004) 8. Moore, A.W.: Discrete content-based classi?cation — a data set. Technical Report IRC-TR-04-027, Intel Research, Cambridge (2004)



All rights reserved Powered by 甜梦文库 9512.net

copyright ©right 2010-2021。