A Community Based Approach for Spam Filtering |
37 views |
A Community Based Approach for Spam Filtering
Deepak P, Model Engg: College, Kochi, India deepak-p@eth.net Abstract
We might have heard quite a few people say on seeing some new mails in their inboxes, “Oh! That spam again”. People who observe the kind of spam messages that they receive would perhaps be able to classify similar spam mails into communities. Such properties of spam messages can be used to filter spam. This paper describes an approach towards spam filtering that seeks to exploit the nature of spam messages that allow them to be classified into different communities. The working of a possible implementation of the approach is described in detail. The new approach does not base itself on any prejudices about spam and can be used to block non-spam nuisance mails also. It can also support users who would want selective blocking of spam mails based on their interests. The approach inherently is user-centric, flexible and user-friendly. The results of some tests done to check for the feasibility of such an approach have been evaluated as well.
Jyothi John, Model Engg: College, Kochi, India jyothijohn@mec.ac.in
Sandeep Parameswaran, IBM Global Services India Pvt. Ltd., Bangalore, India psandeep@in.ibm.com
Spam filters have certain considerations and certain quality parameters. Spam precision is the percentage of messages classified as spam that truly are. Spam recall is the proportion of actual spam messages that are classified as spam. Non-spam messages are usually called solicited messages or legitimate messages. Legitimate precision, analogously, is the percentage of messages classified as legitimate that truly are. Legitimate recall is the proportion of actual legitimate messages that are classified as legitimate [2]. Spam precision is the parameter to be maximized. We do not want any legitimate messages to be classified as spam even if some errors occur the other way round. More plainly, the number of false positives should be reduced to a minimum.
3. Approaches to filter spam
The current techniques to filter spam mail do it by means of classifying a message as either spam or nonspam (legitimate). Most of them do statistical filtering using methods such as identifying keywords, phrases etc. Some of the different approaches have been reviewed in the subsections as under. Naïve Bayesian approach has been proposed as a methods for spam filtering ([2], [3], [5]) and techniques have been [proposed to make naïve Bayesian filtering to be viable in practice [4]. Memory based approaches have been studied ([7], [8]) and implemented as well [6]. Neural networks also have been used for the said purpose [9]. Keeping a blacklist of addresses to be blocked, or a whitelist of addresses of addresses to be allowed are also used very widely. Using extended mail addresses has been described as well [10].
1. Introduction
Spam mail can be described as ‘unsolicited e-mail’ or ‘unsolicited commercial bulk e-mail’. Spam is becoming a great problem today and survey reports show that in most cases, more than 25% of e-mail received is spam [1]. Thus spam mail is becoming an increasing concern and the need to prevent it from continuing to clog the mailboxes is assuming greater significance. The focus of this study is to filter spam mail, to shield the spam mail away from the users so that the waste of time due to time spent on detection and dealing with spam mails can be eliminated (or reduced atleast). The losses due to bandwidth consumption and mail server processing load are not considered here. Section 2 enumerates the different quality parameters for spam filters. Section 3 describes the current approaches towards spam filtering. Section 4 discusses how they deal with communities. Section 5 describes and evaluates a new approach towards spam filtering which is based on spam communities. Section 6 narrates some experiments conducted. Section 7 lists some conclusions and possible future work and Section 8 lists the references.
4. Spam mail communities and current approaches
It would be a common observation that spam mails can be classified into various communities, some of them being, ‘online pharmacies’, ‘mortgage’, ‘vacation offers’ etc. Such communities are obvious and identifiable on visual inspection, but there might be a lot of not-so-explicit communities that are machineidentifiable such as ‘porn-mails bearing links to xyz.com’ etc. None of the current approaches classify mails to such extents. Some classify mails only as ‘spam’ and ‘legitimate’ whereas some classify spam mails as ‘pornspam’ and ‘other-spam’. Memory based approaches are naturally feasible to such classifications where each element of the vector can be used to indicate a class of
2. Considerations for spam filters
© ICTTA’04 Organizing Committee, 2004
spam, the first element may indicate the probability of it being a porn-spam, the second may indicate the probability of a message being a ‘get-rich’ spam and so on. But clearly, the number of classifications that can be imposed by such techniques is limited to the number of elements in the vector. The other methods, which are mostly based on statistical clustering, cannot be imparted with such community identification techniques easily. The communities need not be hardwired into the system, and a spam filter may be imparted with the capability of automatic identification of such spam communities. If the system is to be built into the client end, the communities can even be very much userspecific, a system working to filter mails for a person receiving only ‘online prescription’ related spam may build communities such as ‘weight-loss’, ‘anti-aging’, ‘sexual enhancement’, ‘hair loss’ etc. A person who wants to receive ‘anti-aging’ advertisements may mark that community as non-spam and thus, identification of such communities can be used to impart more flexibility or to make the filter more user-centric.
5.2.1 The phase of ignorance. Upon installation of the application, the system is ignorant of what spam is. The user has to mark the spam mails among the incoming ones and thus point to the system, ‘hey, this is spam’. The system records the entire message. This continues until about 50 messages are accumulated by the system. Even in this time, it can automatically filter and accumulate mails using trivial heuristics such as ‘this is spam as he had marked a mail from this address as spam earlier’. 5.2.2 The message similarity computation. One among the main algorithms to be used here is the computation of similarity between two messages. It may use heuristics such as ‘add one to the similarity score if both have atleast two common names in their “To” address’. We could represent a message as a vector of words occurring in it and taking the dot product of the vectors of the messages. Our algorithm for the test implementation is given as below. Table 1. Algorithm Similarity Score Algorithm Similarity-Score(Messages M1 and M2) { Remove the repeated words in both messages to get messages N1 and N2; The number of intersections of words in the messages N1 and N2 is calculated and output as the similarity score; } 5.2.3 The identification of communities. After accumulating close to 50 spam messages on the advice of the user, the system identifies communities of similar messages. It builds a graph with the messages as nodes and each undirected edge connecting two messages being labeled by the similarity weight between them. Now the strongly connected components are identified. Our implementation used the following algorithm. Table 2. Algorithm Community-Identification
5. A community-based approach
5.1 Underlying concepts
The main assumption or the foundation of this approach is that spam mails can be classified into a lot of communities. Communities of mails may be as precise as ‘mails sent from mail addresses starting with abc and containing the word aging atleast two times in bold capitals’ (such descriptions would be implicit as the communities are identified by the algorithm) or as general as just ‘porn-spam’. The former kind of definitions may be appropriate in cases where the user receives spam from just two or three mailing lists. Another factor being addressed by such an approach is that of making the spam filter as user-centric as possible. This approach is most appropriate to be implemented on the mail client, and in whatever manner it is implemented, separate lists and tables have to be kept for each user. Another advantage of this approach is its flexibility. Nuisance mails (constant requests for help from a distant friend) can also be identified as a system implementing this approach does not come hard coded with a set of rules such as ‘a mail having the word ‘sex’ would be spam 99% of the time’. Thus a person who would like to receive porn-spam but not others also can be accommodated. The system need not have any prejudices, it can learn from the user over time.
5.2 The approach and how it works
The general working model of an application using this approach (and thus the approach) is presented as under. The different phases and how the algorithm works are presented under the different sub-headings, with possible implementations listed as well. The algorithms used in our test implementation have been described in detail in apporporiate areas.
© ICTTA’04 Organizing Committee, 2004
Algorithm Community-Identification() { Build a graph with the 50-odd messages as nodes and undirected edges between them labeled by the similarity scores of the messages in question; Prune all edges which have a label value below a threshold T, resulting possibly in a disconnected graph; The connected components of the graph are enumerated as a set of communities N; For each pair of communities in N { If each similarity-score between a message in a community and a message in the other community bears a label not less than a threshold T1, merge the communities; } The merger in the previous step results in a set of communities N1; Output N1 as the set of communities of messages; } The initial threshold T may be set to a higher value than T1. This is because, we do not want any unrelated messages to be falsely included as a community in N. Thus we expect N to consist of highly coherent communities. But our urge to avoid false communities, may well have caused splits of logically coherent communities (which are coherent enough to levels of detail that we expect). The second spet of refinement of N to build the set N1 is a step towards merging such communities. We merge communities that are coherent enough such each message in a community bears atleast some relationship or similarity (enforced by T1) to each message in the other community. This step may be avoided if T is set to a low value, but the risk involved in such an approach is very obvious. 5.2.4 Community Cohesion Scores and Signatures. We have to compute a score for each community which indicates the cohesion within the community. We also can assign signatures to communities which may consist of a set of words which occur very frequently in the community. The signature could also be a set of messages from the community, which are as varied as possible. Suppose a community consists of 3 sets of 10 identical messages each, the signature should consist of atleast one representative from each set. We used an algorithm to remove very identical messages, given as below.
Algorithm Refine(Set N1) { while(1) { For each message pair, P and Q { Eliminate duplicate words in each message to form P1 and Q1, the sets of words in each message. If ((the cardinality of P1 intersection Q1)>(cardinality of the symmetric difference between P1 and Q1)) { Choose P1 or Q1 arbitrarily and eliminate it from the community; } If no message could be eliminated in a complete pass, break out of the loop; } Return the newly formed set of messages N2, whose cardinality is less than or equal to N1; } This elimination of nearly identical messages saves space in the spam filter database and reduces the amount of computation to be done. 5.2.5 Spam Identification. Each incoming message is tested against the signatures of each spam community and if is found worthy enough of being included in the community, it is tested whether its inclusion would enhance the cohesion within the community. It can be added to the community and marked as spam if it either increases the cohesion of the community. If not, it is marked as legitimate and passed to the user. Our test implementation used the following algorithm for the actual spam filtering process.
Table 3. Algorithm Refine
Table 4. Algorithm Test
© ICTTA’04 Organizing Committee, 2004
Algorithm Test(Message K) { For each community C in N2 { worthy-of-inclusion score = the mean similarityscore between K and a message in C; } If (the maximum worthy-of-inclusion score obtained exceeds a threshold T2) { include K in the community with which the maximum worthy-of-inclusion score was obtained and flag K as spam; } else { Flag K as legitimate; } If (K was included in a community) { perform the refine algorithm on N2 (or more specifically, on the community in which K was included) and assign the new set of communities to N2; perform the merge algorithm on N2 and assign the new set of communities to N2; } } The merge algorithm used is the same as the merging procedure in the community identification algorithm. However we reproduce the algorithm here once again. Table 5. Algorithm Merge Algorithm Merge(N2) { For each pair of communities in N2 { If each similarity-score between a message in a community and a message in the other community bears a label not less than a threshold T1, merge the communities; } The merger in the previous step results in a set of communities N3; output N3; } 5.2.6 Maintenance. If the user opines that a message delivered to him as legitimate was actually spam (a false negative), it can be added to the community to which it best fits or as a single member community. Periodically, if there is a proliferation of small communities, those can be gathered and processed just as the initial set of 50 odd spam messages to identify larger communities. If the user opines that a message marked as spam was legitimate (the dreaded false positive), the system can inspect the communities to find messages of very high similarity with the one in question and they can be deleted from the database of spam messages. Further it can show the user the community in which the false positive was put in and ask whether he feels that the community was actually something of interest to him.
As more and more messages are identified as spam, they are added to the database. Periodically we have to ‘clean’ the database. This can be done by algorithms similar to ‘Refine’. Periodically, the system can do a warm reboot, by dissolving all communities and identifying them from the entire set of messages using techniques used to process the initial set of 50 odd messages. A cold reboot would obviously, be to empty the database. Our test implementation worked in an environment with no interaction from the user. It was supplied with a set of 50 known spam messages and then with a set of messages to be identified as either spam or legitimate. The proliferation of messages in spam communities was avoided by the periodic application of the merge and refine algorithm as presented in the previous section. 5.2.7 Adaptation. Adaptability to changing nature of spam is to be taken care of. It can be done by the system by identifying and deleting communities that have had no admissions for a long time. Perhaps the user might have been taken off the list or the nature of spam sent by the spammer would have changed. In either case, holding the community in the database would be of no use. Further the user could be provided options to manually clean up or delete communities. Although handling adaptation would not be too difficult, we did not handle it in our implementation as the tests were performed on spam messages that came in within a short duration during which significant changes in the nature of spam would not have occurred.
5.3 Advantages
The system comes in with an empty memory and learns what spam is, from the user. The user is free to point to some nuisance mail (such as an old lover who is no more interesting) and mark it as spam. If the heuristics used for similarity computation give high weightage to the sender’s address (or perhaps even content), the user stands a good chance of not being troubled by the nuisance mail in the future. The initial empty memory of the system provides some more advantages. A person entertaining some special spam category, e.g., porn-spam, can continue to keep himself entertained by not marking them as spam during the ignorance phase. The system provides little help in the phase of ignorance, but more importantly it does not come in the way. Further, even after the ignorance phase, he can view the communities and mark one that he is interested in as non-spam. In cases where spam comes to a user from only a few spammers, each community might get precisely mapped to a single spammer. In such cases, small changes made by the spammer in his mails would not lead to them being recognized as false negatives, thus providing increased precision over conventional statistical filters. Further, as the system is implemented per user, the implicit rules may be more user-specific, thus providing more flexibility to the user.
© ICTTA’04 Organizing Committee, 2004
5.4 Disadvantages The user is provided with little or no support during the ignorance phase. The mails themselves are stored in the database, thus increasing storage requirements. Bandwidth wastage is not prevented. Initially, user has to mark the spam, thus giving no indication of the presence of a filter atleast in the early stages. The system might take a lot of time to start filtering mails very effectively.
Spam Recall Legitimate Recall False Positives False Negatives
62.5% 70.0% 03 15
6. Experiments and results
The main aim of the experiment was to test the feasibility of the application of the concept of community clustering of spam mails to implement spam filtering. The implementation done was tested on a non-interactive environment with no user input possible amidst the process. The testing was done on 2 test sets, each of 100 mails, which would be referred to as Set 1 and Set 2 hereafter. 50 of those mails were marked as spam to be used as an ‘initial set’, and the rest of the messages were a collection of both spam and legitimate messages and is henceforth referred to as the ‘test set’. The value of T & T1 were set to 12 and 6 respectively (Section 5.2.3). The value of T2 was set to 13 (Section 5.2.5). The isolated nodes were considered as singleton communities in N. Singleton communities which could not be merged with any other ones, were discarded in N1. The rest of the algorithms are not parameterized and were included as such. Each message apart from the initial set of 50 messages were subjected to the algorithm Test and the results were logged. The results table given below are the values obtained from the log file. The number of communities does not change in the course of the algorithm no user input is sought in real-time. Thus this test just demonstrates the feasibility of the approach. Table 6. Test Results Tests on Set 1 Number of communities in N1 10 Total messages in N1 initially 42 Total messages in N1 after Refine 37 Proportion of ‘initial set’ clustered 74% Number of spam messages in ‘test set’ 35 Number of legitiamate messages in ‘test set’ 15 Spam Precision 84.0% Legitimate Precision 44.0% Spam Recall 60.0% Legitimate Recall 73.3% False Positives 04 False Negatives 10 Table 7 Tests on Set 2 Number of communities in N1 09 Total messages in N1 initially 39 Total messages in N1 after Refine 35 Proportion of ‘initial set’ clustered 70% Number of spam messages in ‘test set’ 40 Number of legitimate messages in ‘test set’ 10 Spam Precision 89.3% Legitimate Precision 31.8%
We consider the spam precision results as very good considering the fact that no hard-coded ruiles were used. Very low legitimate precision is infact of not too much concern as the number of false negatives wouldn’t have disastrous consequences. The legitimate recall is a bit lower than expected, and the number of false positives is a cause for concern and calls for finetuning of the algorithm to reduce false positives. The spam precision testifies that the approach is feasible in the real world. Further, in the real-world, the database could well be tuned based on the user-inputs to provide better results. Further, tehse experiments considered only the texts of the messages, image similarity measures and subject line similarity computations may well enhance the performance.
7. Conclusions and future work
As indicated by the experiments, it can be concluded that community-based detection of spam can prove to be a useful technique. It can be implemented as a mail client add-on, whereby the complex matching algorithms can be done at the client machine (implementing such computationally intensive algorithms on the server might not be inviting). The experiments above indicate that the above approach explained at Section 5.2 would perhaps be feasible. Future work may be directed towards developing better algorithms for spam message similarity computation, for selecting victims to be purged off to limit database size, to enable the system to self-adapt to the changing nature of spam mails, and approximation algorithms for identification of communities from a corpus. This approach treats spam and legitimate mails asymmetrically, in that it clusters spam mails into communities, but doesn’t deal with legitimate in any sophisticated manner. Studies have to be performed as to whether legitimate mails can be dealt with in the same manner (by building communities). Feasibility of such an approach depends on the clusterability of legitimate mails which, even if it does exist, is not obvious.
8. References
[1]. Surf-Control’s Anti-Spam Prevalence Study 2002, URL: http://www.surfcontrol.com/resources/AntiSpam_Study_v2.pdf [2]. A bayesian approach to filtering junk e-mail, Sahami, Dumais, Heckerman & Horvitz, Learning for Text Categorization: Papers from the 1998 Workshop, Madison. [3]. A plan for Spam, Paul Graham, August 2002 URL: URL: http://www.paulgraham.com/spam.html [4]. An evaluation of naïve Bayesian anti-spam filtering, Androutsopoulos et. al., Proc. of the workshop on Machine Learning in the New Information Age, 2000 [5]. Better Bayesian Filtering, Paul Graham, January 2003 URL: http://www.paulgraham.com/better.html
© ICTTA’04 Organizing Committee, 2004
[6]. TiMBL: Tilburg Machine Based Learner version 4.0 Reference Guide, Daelemans et. al. (2001) [7]. Learning to filter spam e-mail: A comparison of naïve Bayesian and a memory based approach, Androutsopoulos et. al. , In Workshop on Machine Learning & Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2000).
[8]. A learning content-based spam filter, Tim Hemel [9]. Junk Detection using neural networks, Michael Vinther, 2002. URL: http://www.logicnet.dk/reports/JunkDetection/JunkDetection. pdf [10]. Curbing junk mail via secure classification, Bleichenbacher et. al. Financial Cryptography, 1998, pp. 198-213
© ICTTA’04 Organizing Committee, 2004