WO2007016869A2

WO2007016869A2 - Systems and methods of enhanced e-commerce,virus detection and antiphishing

Info

Publication number: WO2007016869A2
Application number: PCT/CN2006/001987
Authority: WO
Inventors: Marvin Shannon; Wesley Boudeville
Original assignee: Metaswarm (Hongkong) Ltd.
Priority date: 2005-08-07
Filing date: 2006-08-07
Publication date: 2007-02-15

Description

Systems and Methods of Enhanced e-Commerce, Virus Detection and Antiphishing

CROSS-REFERENCES TO RELATED APPLICATIONS

This application claims the benefit of the filing date of U.S. Provisional Application, Number 60/595804, "System and Method for an Anti-Phishing Plug-in to Aid e-Commerce", filed August 7, 2005, and which is incorporated by reference in its entirety. It also incorporates by reference in its entirety the U.S. Provisional Application, Number 60/595808, "System and Method of Using Tags Against Viruses", filed on August 7, 2005, and likewise for the U.S. Provisional Application, Number 60/595809, "System and Method of Anti Spear Phishing and Anti-Pharming", filed on August 7, 2005.

TECHNICAL FIELD

This invention relates generally to information delivery and management in a computer network. More particularly, the invention relates to techniques for protecting users against computer viruses, phishing and pharming.

BACKGROUND OF THE INVENTION e-Commerce can be threatened by a phisher who sets up a website, often called a pharm, that might pretend to sell some good or service. The pharm induces the browsing visitor to make a purchase, often by entering financial details like a credit card number. The pharm might have images of seals from various electronic assurance companies. To persuade the visitor that the site is real. But these seals might be bogus, since images of real seals can be trivially copied on the Web.

Another serious and continuing problem in computer usage is viruses and worms. (In this Invention, we shall use the term "virus" to include also "worm".) These propagate by various means. Often, via email or by a user downloading a file from a website. In both cases, typically there might be comments in the email or on a web page from which the file was downloaded, that claims the attachment or file is benign or that it will do some given task. Whereas, once on the user's computer, it actually does a subversive task.

Usually, a virus is in binary form. Though it might also exist in a higher level format, like ASCII, where perhaps it attempts to be compiled on the user's computer, by some means. Also, viruses exist in every scripting language, like Perl or PHP, where these scripts are written in ASCII. Plus, common proprietary formats like Postscript or PDF (both from Adobe Corp.) are effectively computer languages. Viruses exist in these languages. There have been many attempts to detect viruses. A comprehensive recent description is in "Virus Research and Defense" by Szor (Addison- Wesley 2005). Most of the current methods involve researchers at an anti-virus firm obtaining suspected files, and then doing intensive manual analysis. The results of this analysis are then propagated to customers of the firm, usually in the form of virus "signatures". The customers then have software furnished by the firm, which uses this data to tiy to match against new, incoming files. Sometimes, a signature might be a checksum or hash of a known virus. But polymorphic viruses have been written in response. These can change their instantiations, resulting in different checksums or hashes. Which renders a simple hash signature useless. The anti-virus firm then has to investigate each polymorphic virus, for some invariant internal structure that can be used as an effective signature. Quite apart from the often difficult manual nature of this investigation, it often has to be repeated for each new type of polymorphic virus that is detected. The problem is ongoing. The above solutions need to be continually found.

And the anti-virus efforts are always reactive. The virus writers move first. The problem is fundamentally hard. Given an arbitrary file, what does it do or can it do if it is run (executed)? It is more difficult than simply running the file. In general, a single run of a file does not necessarily test all its functionality. A virus could have conditions or triggers that it looks for, and which can be hard to discern. As phishing has continued to plague the Internet, a pernicious version has arisen. Traditional phishing involved the sending of millions of messages, typically purporting to be from a large financial institution, or perhaps eBay Corporation. The messages would try to trick the recipients into submitting their personal information (like a username and password and Tax Id) into a website run by the phisher. Some simple, and somewhat ineffective, antiphishing methods would try to detect these messages, in part by the sheer multiplicity of the messages.

Let Amy be a phisher. The new phishing version involves the phisher targeting a specific company.

Amy finds the email addresses of several users at the company. This might be done by simply perusing the company's publicly viewable web pages. Or perhaps by asking a search engine for results (web pages) containing the string "@company.com", where here we imagine that the company has a domain "company.com". By whatever means, Amy amasses these email addresses. If she cannot find these for a given company, then she goes on to target another company.

Amy then sends messages to those users. Typically, these messages might claim to be from the company's personnel or IT department. Often, there is a link to an external website run by the phisher. If the unwary user clicks on it, then the user is shown a web page where he is meant to enter various personal data. Sometimes this might just be his company password. Because then the phisher can use this and the user's username, to login and do other attacks. Or, the user might be asked to furnish other types of personal data, that might let Amy impersonate the user in contexts external to the company.

This type of attack is sometimes called "spear phishing". Because only a few users are targeted at a time. The low multiplicity of the messages can make it hard for some antiphishing methods to find them.

It is also potentially more dangerous. Because the phisher has time to craft a message that is very specific to the target company. And perhaps even to the users, if Amy can amass enough contextual information (through a search engine, say) about the users and their jobs. Plus, by using a Sender address that supposedly is inside company.com, it increases the odds that a user will regard it as real.

Spear phishing also tries to avoid another antiphishing method. This involves user education, where a user is warned not to divulge personal data in response to a message claiming to be from that user's bank or credit card company or similar institutions. But when a user at work gets a message claiming to be from someone else in the same company, often her guard might be lowered.

Typically, in many companies, an employee might regard it as part of her job to respond to a plausible request from another employee. An even more dangerous version of spear phishing occurs when Amy has, by whatever means, successfully introduced a virus into one of the company's computers. Where this virus is able to send messages from within company.com.

Plus, instead of having a link to an external domain run by Amy, it has a link back to the virus's computer. The virus is also assumed to be able to run a web server. Then, a recipient of a message from the virus has a much harder time discerning its bona fides. Here, the virus is able to take any users' replies and then relay them to a destination outside the company.

SUMMARY OF THE INVENTION The foregoing has outlined some of the more pertinent objects and features of the present invention. These objects and features should be construed to be merely illustrative of some of the more prominent features and applications of the invention. Other beneficial results can be achieved by using the disclosed invention in a different manner or changing the invention as will be described. Thus, other objects and a fuller understanding of the invention may be had by referring to the following detailed description of the Preferred Embodiment.

We show how an Aggregation Center (Agg) and its browser plug-ins can directly take part in e-commerce, when a user visits a website and makes a transaction. The plug-in can validate the user and the website to each other, making a transaction more likely. And the transaction details can be handled at the Agg, which can act akin to a clearinghouse. The website essentially outsources the transaction processing to the Agg. This can also enable a greater usage of micropayments.

We can detect fake electronic seals, used by a phisher in messages or pharms, to mislead the visitor into thinking that a validating organization has approved the message or website. The Agg getting various data from the validators, and passes these to its plug-ins. A plug-in can apply deterministic tests to websites and electronic messages, to test the validity of such seals.

We extend the tags and the Agg to combat viruses. By putting a tag into a file released by a company. The tag points to the company. The user has a program that checks the file, by finding a hash, and submitting the company name and hash to the Agg. The company has previously sent the Agg a list of hashes of its published data. Hence the Agg can tell the user's program whether the file came from the company or not. Our method can also detect unauthorized changes to a company's published files. Where these changes may not necessarily be due to a virus.

We attack spear phishing. The latter is a targeted attack against employees of a company, by sending messages pretending to be from someone else in the company. These messages often have a link to an external website controlled by the phisher. Our Partner Lists and a browser plug-in can detect unauthorized links. We can also handle a virus infecting a company computer and then sending messages linking back to that computer.

We attack a phisher using a pharm in a Man In The Middle attack. We detect this to high probability. A bank customer logging in remotely presents her username and a hash of her password and current network address. The bank computes a hash of that password and the network address from which she is supposedly coming from. If the hashes differ, it suggests a MITM attack.

BRIEF DESCRIPTION OF THE DRAWINGS

There is one drawing. It shows a user at a PC, browsing at a website, with a browser plug-in that communicates with an Aggregation Center. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

What we claim as new and desire to secure by letters patent is set forth in the following claims.

We described a lightweight means of detecting phishing in electronic messages, or detecting fraudulent web sites in these earlier U.S. Provisional: Number 60522245 ("2245"), "System and Method to Detect Phishing and Verify Electronic Advertising", filed September 7, 2004; Number 60522458 ("2458"), "System and Method for Enhanced Detection of Phishing", filed October 4, 2004; Number 60552528 ("2528"), "System and Method for Finding Message Bodies in Web-Displayed Messaging", filed October 11, 2004; Number 60552640 ("2640"), "System and Method for Investigating Phishing Websites", filed October 22, 2004; Number 60552644 ("2644"), "System and Method for Detecting Phishing Messages in Sparse Data Communications", filed October 24, 2004; Number 60593114, "System and Method of Blocking Pornographic Websites and Content", filed December 12, 2004; Number 60593115, "System and Method for Attacking Malware in Electronic Messages", filed December 12, 2004; Number 60593186, "System and Method for Making a Validated Search Engine", filed December 18, 2004; Number 60/593877 ("3877"), "System and Method for Improving Multiple Two Factor Usage", filed February 21, 2005; Number 60/593878 ("3878"), "System and Method for Registered and Authenticated Electronic Messages", filed February 21, 2005; Number 60/593879 ("3879"), "System and Method of Mobile Anti-Pharming", filed February 21, 2005; Number 60/594043 ("4043"), "System and Method for Upgrading an Anonymizer for Mobile Anti-Pharming", filed March 7, 2005; Number 60/594051 ("4051"), "System and Method for Using a Browser Plug-in to Combat Click Fraud", filed March 7, 2005.

We will refer to these collectively as the "Antiphishing Provisionals". We will also reference these U.S. Provisional submitted by us: Number 60/320046 ("0046"), "System and Method for the Classification of Electronic Communications", filed March 24, 2003; Number 60/521 ,698 ("1698"), "System and Method Relating to Dynamically Constructed Addresses in Electronic Messages", filed June 20, 2004; Number 60/521,174 ("1174"), "System and Method for Finding and Using Styles in Electronic Communications", filed March 3, 2004; Number 60/481,745 ("1745") "System and Method for the Algorithmic Categorization and Grouping of Electronic Communications", filed December 5, 2003.

Our Invention has three main sections - using the antiphishing plug-in for e-commerce, using tags against viruses, and enhanced anti-spear phishing and anti-pharming. A common, unifying theme is the central role played by an

Aggregation Center (Agg), often in conjunction with a browser plug-inthat can get information from the Agg.

1. Using the Antiphishing Plug- in for e-Commerce

In the Antiphishing Provisionals, we described how to use an Agg with a browser plug-in, for a deterministic detection of phishing messages and websites (pharms) run by phishers. Our earlier Provisionals deliberately shied away from involving the plug-in or Agg in an actual transaction. In this method, we show how the combination of the plug-in and Agg can take part in such a transaction. Moreover, we show how this ability might itself enable transactions that would not otherwise have occurred. Consider a user, Jane, with a browser and plug-in. Imagine a website, Nu, that she visits. In general at that point, neither party knows anything about the other. For one, even if Jane has visited Nu before, how can she know that this current website is actually the true Nu? It might be a pharm. To this end, the Antiphishing Provisionals had many methods, most crucially involving Partner Lists, that let the plug-in validate or invalidate Nu.

The plug-in can also present information to Nu about Jane. Nu might expose an Application Programming Interface (API) or Web Service, to be used by the plug-in, if the latter exists on the browser. Nu might perform some validation procedure with the plug-in and possibly the Agg, such that Nu has confidence that the plug- in is itself valid. This procedure might involve some type of zero knowledge protocol. Where, for example, Nu might ask the plug-in for some information about the plug-in, which Nu then sends to Agg for confirmation. Typically, Nu could assume that Agg, which is located on the network, is a valid authority. Depending on Agg's reply, other steps might be forthcoming.

We assume that Nu now regards the plug-in as valid. It can then ask the plug-in for some information on the user which is accessing its website. Thus, the plug-in (and Agg) vouch for Jane.

The information about Jane that is sent to Nu can be controlled by her in some simple fashion. For example, at an earlier point in time, she might bring up the browser, login to the plug-in and then set this information. The actual information sent to Nu is some combination of this and other information from the plug-in. The idea is that the plug-in and Agg offer information to Nu that preserves Jane's privacy, while still furnishing reliable data on her. For example, the plug-in might send some type of credit assessment of Jane. Or some spending limit that it and the Agg can validate, and which Jane is willing to tell another website, in this fashion.

Here, we extend the functionality of the Agg from that described in the Antiphishing Provisionals. It stores electronically monetary accounts of several (or many) of its users; specifically including Jane, in our examples. Users can access these accounts by various means. And the Agg might also have interactions with established financial networks, like that used for Electronic Funds Transfer.

An alternative implementation is that these accounts are handled by a separate, established financial entity, like a bank. Where the Agg essentially plays a passive role in the transactions, by relaying such data between a plug-in and the bank. For simplicity in what follows, we shall assume that the Agg directly maintains such accounts, though this other implementation is always possible.

The information about Jane that is passed to Nu should be written in some standard structured form. Possibly in XML. If so, then perhaps in some extension to one or more of the Electronic Commerce Markup Language (ECML), the Web Service Description Language (WSDL) or the Business Process Execution Language (BPEL). The passing of such structured information from Jane's computer to Nu should not be confused with the functionality of electronic wallets. These are programs that reside on a user's computer, and are often little more than convenient form fillers. That is, the user might enter various information about herself, and these are later passed to websites, saving her the manual retyping. But typically, there is little validation of such information. For a website getting information from a wallet, there is usually no increase in the reliability of such information.

Based on the validated information from the plug-in, Nu can make several decisions as to what it shows Jane. Notice that this does not require Jane to be an existing member of Nu's website, and to have logged into Nu. Thus the plug- in (and Agg) lets Jane have more access to the web, without the tedium and error prone steps of having to become a member of various websites. And without losing some privacy to those websites by entering information about herself. Nu can decide to only show some products, based on Jane's spending limit or credit assessment. Often, many commercial websites use web pages that are dynamically generated. So logic could go before the generation, that uses the validated information.

Assume that Nu has an account at Agg that can receive electronic payments. Typically, for Nu's safety, this account is configured so that it is a "push" account. That is, anyone can send money electronically to it, but only Nu can withdraw money. Hence, Nu can publicize the account.

It also follows that the Agg can also ask each of its corporate customers for a list of its valid push accounts. Typically, these are made public. Since one attack is for a message or pharai that pretends to be from or of Nu to then give a false push account address. To the extent that such addresses can be extracted programmatically from a message or web page, the plug-in can extend its efficacy. This also has merit in helping Nu guard against a subversive attack. Where one of its employees is a phisher, who then changes a web page and gives her own account address, or does likewise in outgoing messages. It assumes that the actual valid push addresses have not been compromised when they were sent to the Agg.

Suppose on a page, Jane decides to buy an item, and presses a button on that page to do this. Nu can then contact the plug-in or the Agg directly, to perform the funds transfer. Such communication might be by some encrypted means like https. This can be far faster than alternative means, like Jane paying via Paypal.TM or by going to a bank's website, where she has an account at that bank, logging in, and then sending funds via wire transfer or some other electronic means to Nu. All of these methods involve many manual steps. By contrast, using the plug-in or Agg might involve only a simple confirmation. Suppose the plug-in contacts the Agg to move funds between Jane and another entity, and the method of "4051" is being used to credit a search engine that had a link to that entity, which Jane clicked on, and thus eventually led to this transaction. Here, there is no programmatic ambiguity to the plug-in about whether a transaction occurred. Hence, using "4051" it can contact the search engine, informing it that it is entitled to a commission from the transaction. 1.1 Micropayments

The above discussion grew out of the desire for the user and website to have some means of getting valid information about the other. Hence we showed how the plug-in and Agg can be used to do this. Here we show how the plug-in and Agg can enable a much broader use of micropayments. There has been much work on implementations of the latter. But numerous problems have emerged. The first revolves around the size of such a payment. Often a micropayment is considered to be some amount between one U.S. cent and one U.S. dollar. But it is impractical to expect a user to spend more than a minimal amount of time entering in data about herself, to pay such an amount. (Though to some extent, a standard electronic wallet might reduce such a time.) There is also the consideration that the financial payment organization might impose a fee, possibly a fixed size fee, that makes the transaction uneconomic to the payee (recipient). These reasons take on even more force if a transaction is a fraction of a cent.

Another problem is having the website maintain elaborate financial software to handle such micropayments, as well as expecting users to do likewise on their machines.

The Agg addresses the latter problem, by centralizing this software on itself.

The plug-in addresses the former problem. It can present numerous options to Jane, that let it pay such transactions, without any manual involvement by her, except perhaps in an exception mode. For example, Jane can set several of these values:

1. Maximum amount for which the plug-in can automatically pay, per transaction. 2. Ditto, per website, per day.

3. Ditto, for all websites visited, per day.

These constraints could then be applied.

1.2 Validating Electronic Seals

As websites have been deployed for e-commerce, many have sought to reassure visitors about their bona fides. To this ends, other companies have offered validation services in conjunction with electronic seals. The latter companies include Truste Corporation, Verisign Corporation and the Better

Business Bureau. Typically, such an organization, which we term a "validator", would conduct a review of a corporate customer, where this review might be of financial or other matters. If the customer passes muster, then, amongst other services, it is allowed to post an electronic seal on its web pages. Typically, this seal includes some type of image, though it could also be animation (video) or possibly also having audio. The image refers to the validator, and attests that the website is reputable. Also, the seal is often clickable, so that the user can find out more about what the seal's accreditation means and about the validator itself.

The validator might also let its customer display the seal in electronic messages sent by the customer, where there messages might be written in some markup language that has hyperlinks (like HTML). So that if the message's recipient views it in a program that can show the markup language, like a browser, then she can also get reassurance from that seal, and click on it for more information. However, as phishing and pharming have increased, some shameless phishers have written seals into their messages and pharms, without any permission from the validators. There are various degrees to which these seals are faked by the phishers. Here, we delineate those degrees and explain how our antiphishing plug-in can implement countermeasures.

We consider the simplest case, where the seal is an image (possibly animated). Later we generalize this.

We define some terminology. The Agg can have validators as customers, in the specific role of validator, where they give the Agg information which will be described below. In turn, the Agg can periodically download such validator lists and their associated data to a plug-in, in the same way as it does for regular customers and their Partner Lists and other data ["2245"-, "2458"]. This new type of customer extends the capabilities of the Agg and plug-in. It is also then a possible extra revenue source for the Agg, inasmuch as the Agg and its plug-ins help protect the value of a validator's seal.

Let V = set of Agg's customers who are validators. If the browser is looking at a web page, let D = base domain of that page, where the base domain was described in "0046", and essentially is the minimal rightmost set of fields within a domain that can be purchased. But if the browser is showing a message, and the message has one of our Notphish tags ["2458"], then from it, we get D = base domain of (purported) company that sent the message. Else we can take the sender field of the message and let D be the base domain from that field.

Suppose the seal is loaded from a base domain B. In pseudocode, we can do the following - if ( B in V )

{ if ( D is not a customer of B ) invalidQ; else if ( message AND B does not allow seal in customer message ) invalidO; else if ( website AND B does not allow seal in customer website ) invalid(); else if ( image not clickable AND B's seal must be clickable ) invalidQ; else if ( image clickable ) { if ( B's seal must not be clickable ) invalid ();

// here we also include the case where B's seal link must go to a domain // that has a base domain = B,

// and the case where the link does not go to that domain else if (link does not go to B AND B's seal link must go to B) invalid(); else if ( link goes to unknown URL in B ) invalid();

} else do possibly other tests related to the seal; else do other tests in the Antiphishing Provisional;

In the above, the invalid() refers to various steps, along the lines of those described in ["2245", "2458"], to be done when a web page or message is invalidated by the plug-in. The order of the above tests has no particular meaning. In a given implementation, one might choose a specific order to optimize performance. In part because some of the above tests are unlikely to be often true, but are listed for conceptual completeness.

From the above, it can be seen that the Agg obtains from each of its validators information that includes, but is not limited to, the following: 1. List of customers that have a right to display seal. 2. 2. Can a seal be shown in a customer's website?

3. 3. Can a seal be shown in a message from a customer?

4. Is a seal clickable?

The first item above deserves comment. In general, for an arbitrary company, a list of its customers is kept proprietary. Both for the company's competitive advantage, and because often the customers might not want it publicly known that they are customers of that company. But for the case of a validator, the essence of an issued seal is that it is publicly viewable, for the benefit of it and its customer that publishes the seal. Hence, there should be little problem about the Agg getting such a list, and changes to it, from a validator. Here, there is evermore reason to publicize such a list than in the case of a generic Agg customer, who tells the Agg its Partner Lists (and other types of information).

The list of information from a validator assumes that for all of its customers, the items after the first item apply equally. That is, if a validator says that a seal cannot be shown in a customer's message, then it applies for all the validator's customers. The pseudocode also assumes this. But clearly, all the items after the first item can be generalized to be functions of a customer. (Vectors instead of scalars.) In this case, there would be the obvious generalizations of the tests in the pseudocode. For example, consider the test that currently states "if ( image not clickable AND B's seal must be clickable )".

This would be rewritten as

" if ( image not clickable AND B's seal must be clickable for customer D )".

Currently, a validator B might have an antiphishing measure that consists of requiring that its seal be loaded from its website. Hence, B's web server records the network address of whoever is downloading a seal. B can then send out spiders, or do manual perusal, of these addresses, to try to detect unauthorized usage. Note that this fails if a phisher sends out messages with a seal, where the seal is loaded from B whenever a reader views such a message. In this case, all B knows is that a message at a given message provider has a seal. In general, B does not have access to the text of that message or the links (if any) in it. Whereas our method applies equally well to both messages and websites.

This common, current antiphishing method performed by the validator might be avoided by a phisher. She loads the seal from a location with a base domain B that is not in V. Thus, the plug-in can perform the following pseudocode: if ( B not in V )

{ if ( image clickable AND it goes to a base domain Q in V )

{ if ( Q's seal must be loaded from Q ) invalid(); else if ( D not Q's customer ) invalid();

} do possibly other tests related to the seal;

} Now it is possible for the validator Q to keep a record of the addresses of nodes that go to Q, and of the URLs they present, where these URLs point to locations in Q. But in this case, those nodes are browsers, and unlikely to yield information about any phishing. The presented URLs do potentially have more information, under some limited circumstances. Suppose an URL is valid, as determined by Q. But Q may be unable to tell if the query came from a user at the customer's website or at a pharm imitating the customer, or in a message purporting to be from the customer. But the biggest drawback for Q is simply that the seal is not loaded from its website. Many users do not usually click on a seal. They simply treat its visual presence as sufficient validation. So Q only gets any information when a fraction of users actually clicks on its seal. This gives another advantage to our plug-in, over anything Q can try. Because the above pseudocode runs, regardless of whether the user clicks on the seal or not. So it has a greater chance of detecting phishing. Another advantage of the plug-in is that the above pseudocode runs first, before a user who might actually click on a seal does so. In any given instance, this time advantage might seem small. But if the plug-in can detect phishing, it can invalidate the message or web page. Which is greater safety for the user, by the plug-in turning off suspect links, so that the user can't click on these, for example. Far more immediate safety for the user than even for that minority of users who will click on the seal and perhaps (and certainly not necessarily) be told by Q that the seal is bad. Let us assume that our plug-in is not present, and the latter has occurred. So the user is alerted about a possible danger. But that alert from Q cannot affect the message or page that the user clicked from. That message or page is still dangerous to the user. For example, it might have a form where she was tricked into entering her personal data. The form has a button to upload the data to another computer run by the phisher. Suppose the user had filled out the form and then clicked on the seal, to double check. Even knowing from Q that the message or page is bad, she might still inadvertently click that submit button. Whereas our plug-in can deterministically detect phishing, and invalidate the message or page, before the user has a chance to enter in her personal data.

A variation on the above is where the phisher uses scripting to generate dynamic addresses for loading images, or for (outgoing) links. She is attempting to conceal where she is loading the seal from, or where it points to. We can use our antispam methods of "1698" to detect her attempts.

The phisher might try to load a seal from a location not in V, and have it clickable, but not to any domain in V. If the browser is viewing a message with the seal, the latter might be an image which is loaded from an attachment in the message. The plug-in might run image recognition algorithms against such loaded messages, to possibly identify close matches against images of its validators' seals. Several heuristics are possible. For example, most seals are fairly small, because they are meant to only take up a fraction of a page or message. The plug-in can have a maximum size found over all its validators' seals. So it can ignore images in a message or page outside those dimensions. An important computational saving. Plus, the images it does look up are then small, and would be quicker to analyze. These remarks also apply if the seal is not clickable.

Of course, if a seal that visually purports to be from some validator, B, does not click through to B's domain, or is not even clickable, when it is meant to be, then this increases the phisher's chances of being manually detected by the user.

There are elaborations on the above, in terms of what a validator might do. A validator, V, might require its customer to insert a special piece of HTML, and possibly scripting code, into the customer's page. So that, if there is a script, then when the page is loaded into a browser, the script might execute (if the user enables scripting), get some information about the browser and the computer, and then show the seal's image. This might involve the script going back to Vs website and downloading the image.

More intricate steps might also be imagined. The problem is that these are ultimately futile, in preventing the use of a fake seal. The steps that V tells its customers to do will, in general, be followed by them. But Amy, a phisher, has no need to do so. Most users do not know, or care, if a correct seal involves, say, the scripting steps of the previous paragraph. All they expect to see is a familiar image, that is often clickable. Amy can take advantage of this, by going to a customer's page, with a seal image. She can then use screen capture methods on her computer and image editing tools like Photoshop.TM (from Adobe Corporation), to get an exact copy of the image. Which she can then store on her website, and load from there, into her web pages or messages.

Under some circumstances, Amy might also use an image editor to deliberately produce a visually similar, but different, copy of a seal image. Where she starts with an exact copy of the seal image. This has to do with trying to evade an image comparison method. Exact matches of two sets of data are easy to test for. So she can use the editor to introduce variations into her image, that cannot be detected visually. For example, randomly toggling the low intensity bits that code for color. Plus, she might also change the x or y dimensions of the image, as well as other changes inside the image. Our method refers to external image comparison methods. But it can be seen from this that quantifying the difference between two images can be subjective. Hence, while our other steps are deterministic, this comparison is qualitatively different.

1.3 ISP Methods An ISP can also use the plug-in methods described above, to search for fake seals in its messages, both incoming and outgoing. Then, if it finds any, it could perhaps reject those messages, or mark them as suspect in some fashion. So that any such incoming messages might go into the recipient's bulk folder, for example. While the detection of fake seals in outgoing messages might cause it to scrutinize the senders. Perhaps to see if they are phishers, or if those accounts might have been hijacked by phishers. (Maybe the rightful owners of the accounts had their passwords discovered by phishers, by some means?)

It should be said that for the ISP to use the plug-in methods also strongly implies that the ISP has a relationship with the Agg, to get the necessary information about the Agg's validators. But the Agg is not strictly necessary to an ISP. It could have direct relationships with various validators, and get such information.

An ISP can do more steps to detect fake seals, that a plug-in cannot perform. The toughest scenario to detect is where Amy, the phisher, uses an image of a seal that is not loaded from a validator, and which does not link to a validator. But where that image can be construed by a casual observer as a real seal. The ISP picks a random set of messages. Or, it might already be making Bulk Message Envelopes (BMEs), as part of the antispam measures of "0046". If it has BMEs, then it can pick randomly, out of the top 20%, say, of the BMEs, where we rank the BMEs by the number of messages in each BME.

If Amy is sending forth many messages, then statistically we are likely to have these amongst the chosen messages. And the more messages she sends, the more likely this is true.

For the chosen messages or BMEs, we apply the image recognition method described above. If we identify an image as similar (or the same) as a real seal, then we set a style "Fake Seal", where "1 174" described how we defined other styles and use them to aid in antispam efforts. Here, we add this new style. Here, this style might be implemented, not as a Boolean, as most of the styles in "1174" are, but as a string (character array). The value is the validator whose seal is being faked. So for validator V, we gather those messages or BMEs with its fake seal.

(Assume that we have found some.) We can find the domains in these messages or BMEs and use these as nucleation sites to make a domain cluster as in "1745". Plus, we can spider those domains for further investigation. Hence, the ISP can offer more protection to its users. Yet another variant is where Amy refrains from putting any seals, fake or real, into her messages. Instead, these might have links to her pharm. Those pages would contain fake seals. Here, the ISP can still randomly choose messages or BMEs. And then follow any links within. Where it might apply an exclusion list of large companies, say, that it thinks unlikely (unless they have been subverted) to indulge in pharming. The ISP spiders those links and applies the fake seal detection methods described above to some subset of the spidered pages.

As before, statistically, the more messages Amy sends, the more likely the ISP's spiders are to investigate her website. Also, in practical terms, any fake seals are often likely to be on the page that is immediately linked to, in her messages. Since she wants to reassure a visitor as soon as possible upon visiting her website, of her "bona fides". So in terms of computational effort, the ISP might concentrate on these directly linked pages. 2. Using Tags Against Viruses In our Antiphishing Provisional, we explained the usage of a <notphish> tag inserted into a message sent out by a company like a bank. The bank would then also send its Partner List (PL) to an Aggregation Center (Agg). Then, a user on her computer would have a plug-in to her browser. This would detect emails or websites that she is viewing in the browser, and which have the notphish tag. Then, the plug-in could extract any links in the email or web page, and compare these to those in the PL for that company, where the plug-in gets the PL from the Agg. That usage could be applied to any type of electronic message. Here we extend those methods to be used against viruses. Especially "3878", where we discussed a method of authenticating electronic messages and arbitrary files. Where the latter (and of course the former) are expected to be disseminated in any fashion across the Internet. Plus, the method of this Invention can be explicitly applied to data written in any binary format, as well as higher formats like ASCII. By binary format, we mean a format of ones and zeros that can be directly executed by a CPU. Or by a simulated CPU, as in the case of Java bytecode, for example.

The data to be analyzed might be an attachment in an email. Or it might be a file that is already on the user's computer, where the file has arrived on the computer by some means. Suppose the data is from a company that implements our method. At or near the start of the data is a code written in some standard format (like ASCII perhaps). Without loss of generality, assume that ASCII is the chosen format. For notational convenience, the code might be, for example, "<notvirus a='author.com'>".

We chose a representation of the code in the form of tags like those in HTML, SGML or XML. This choice is arbitrary, and our method is not confined to this specific choice. But given the widespread use of HTML₅ we suggest that the choice offers a useful familiarity to many. The tag has an attribute, shown, above as "a", that designates the author or owner of the data. Call this the "author attribute". Its value is some quantity that identifies the author. In the example, we imagine that the author has a website, called author.com.

Let Alice be a user who has gotten the data. On her machine is a program, Kappa, that analyzes data (files) and classifies the data into one of these categories:

1. Valid

2.Unvalidated [lacks the tag]

3. Invalid It looks for the tag in the data. If there is no tag, then the data is

"unvalidated". Suppose it finds a tag. Then it looks for the author attribute and value. If this is missing, then it might classify the data as "invalid". This is considered to be worse that unvalidated, because it suggests an actual attempt to mislead the program.

Suppose the author attribute and value are found. Assume the value is in the form of a domain name, as shown above. If the value is in an unrecognized form, or the domain does not exist, then the data might be classified as "invalid".

Likewise, if the domain is on a blacklist, then the data might be classified as

"invalid". The blacklist is optional. Entries in it might be put there for various reasons. Some might be spammer domains, for example. Alice might determine what reasons she will use for the blacklist. Here, the blacklist need not be a simple list of domains. Instead, associated with each domain might be a datum that indicates some classification of that domain. So the reasons filter the blacklist. Also, this determination of what reasons to use might be done at a higher level than Alice; by the sysadmin, for instance. Suppose the domain is not on the blacklist. Then Kappa finds a hash ("h") of the data. The choice of hashing function will be some widely used one. Currently an example of this is SHA-I , though others may arise in the future. Optionally, but preferably, the hash is made of all the data. This is crucial. Because a hash is a very sensitive function of its input bits. In principle, and usually in practice, inverting just one bit in the input will pseudo-randomly flip about half the output bits.

Note that a checksum must not be used in place of a hash. Virus writers have discovered that with a commonly used checksum function, the following can be attempted. Take a given file, into which the writer wants to insert a virus. Find the checksum of the file. Then insert the virus and systematically make certain other changes to the file, such that the checksum of the file is the same as the original checksum. While this is not possible for every choice of file and virus, the fact that it works for some choices renders the usage of a checksum very problematic.

Kappa now sends (author.com, h) to the Agg. It is asking the Agg, is author.com in your list of companies, and if so, is h in author.com's list of valid hashes? If author.com is not in Agg's company list, then it replies "no" to Kappa, who then marks the data as "invalid". While if author.com is an Agg customer, then it at some earlier time has presumably sent h to the Agg. Along with hashes of any other data that it might be promulgating. If the h sent from Kappa is in this list, then the Agg replies "yes" and Kappa marks the data as "valid". Otherwise the Agg replies "no" and Kappa marks the data as "invalid".

A status of "invalid" is worse than "unvalidated", because it appears that someone is trying to fake a tag. More suspicious than merely lacking a tag.

Of course, when author.com and other corporate customers of the Agg make their hashes, they need to use the same hash function that the client program, Kappa, uses. Alternatively, an elaboration of the above tag is to have an attribute that designates a choice of hash function.

Optionally, if Kappa finds that the data is invalid, due to its computed hash not being in author.com¹ s official hashes, then it might alert that company or the Agg. Indicating that someone is trying to pass off data as coming from author.com. Optionally, Kappa might offer to upload the invalid data to the company or the Agg, so that they could analyze it. This might be overridden by Alice or by her sysadmin.

The corporate clients like author.com might run a Web Service. This would reply to queries from the Agg, for the latest list of valid hashes. The above is a simple extension of "2245" and "2458". In those Provisional, for antiphishing, we emphasized a company sending to the Agg its Partner List (PL), because a link to a phisher's website is often the crucial part of a phishing message. But here, for arbitrary data, it is the signature (as represented by a hash) that is more germane. The role of the Agg is vital. In general, it is inadvisable for Kappa to directly go to an arbitrary website described by the author attribute. Because a virus writer might sent up a domain, possibly appearing similar to a "respectable" domain, and then direct Kappa there. Where Kappa will then be told that the data's hash is valid. However, it is possible for Kappa to have a whitelist of domains. So that if author.com is in this list, then instead of Kappa presenting (author.com, h) to the Agg, it presents (h) to the Web Service at author.com.

Our method is far easier than a PKI signing of the data. The hashing is faster than a PKI. Most, if not all, governments permit common hash function usage, whereas PKI implementations might be restricted by some governments. Plus a PKI involves cumbersome issues of key handling and revocation, which are moot in our method.

Our method is deterministic, when it classifies data as valid or invalid. (When the data lacks a tag, then existing anti-virus methods have to be used on it.) Some anti-virus methods use heuristics (rules of thumb) to try to identify a virus. These are probabilistic classifications.

Some companies which regularly release software updates might compute hashes of these updates, and then list these hashes on a web page, beside the name of the corresponding update. While this is certainly compatible with our method, the former is far more cumbersome. It is expected that a user confronted with a file that purports to be from author.com, knows enough to find a hash of the file, using the same hashing method as done by author.com, and then goes to the appropriate web page at author.com to visually compare the resultant hashes. A typical user is simply unable or unwilling to do this.

Our method is a whitelist of hashes of known, "good" data. We do not attempt to discern the functionality or intent of data. A far simpler problem. Analogous to "0046" for antispam. Where we avoided a semantic analysis of a message in order to discern if it is spam or not.

An extension of our method is where a company can associate ancillary data with its published hashes. This data might include short descriptions about the file. Like a title and version number, for example.

In passing, note that the concept of a tag at the start of certain types of data is known. For example, the first four bytes of every Java class file has the hexadecimal magic number OxCAFEBABE. It is used by various Java tools as a quick indicator of whether a file is a Java class file or not. Or, in another example, the start of a Postscript file should have "%!PS-Adobe-1.0". (The 1.0 version could be different in some Postscript files.) In our method, we have a more expressive usage.

2.1 Web Services

If a Web Service accepts data, it can use our method to validate the data. Especially if the Service were perhaps to later install and run the data. That is, the data is treated as a program. Currently the main approach taken by the Web Service Description Language and the Business Process Execution Language is to permit the use of digital signing (PKI) of various portions of the XML data that is transmitted between two (or more) Web Services.

Note that the validation we describe here for Web Service data differs from the XML validation of the data. We assume that, the data passes the XML validation. Typically, if it does not, the receiving Service would reject the data anyway.

Our method is especially appropriate in the context of Web Services. As discussed earlier, it is meant for programmatic interaction, and the client companies of the Agg run a Service that answers queries about hashes of its published data. And, as above, it is more lightweight (faster) than PKI.

2.2 Other Extensions

The program Kappa that we have described above could be implemented in several ways. For example, it might be incorporated into Alice's runtime environment, such that she can only run (or in general access) the data if Kappa validates it. Depending on her user privileges, she might be able to override this constraint or not.

If Kappa has a user interface, it might indicate in some simple way the state of data that it has attempted to validate. For example, it might have a widget that turns green when the data validates, and red when it invalidates. This follows our suggestions in the Antiphishing Provisional for an antiphishing browser plug-in.

An implementation of Kappa could be also as a browser plug-in.

While the emphasis of this method has been to combat viruses, it also has broader scope. An associated problem that a company might have is if it regularly releases data to external users. It does not want unauthorized others to issue other data that is purported to come from itself. Or, it does not want copies of its data altered and then promulgated as unaltered data. Broadly, this falls under the rubric of Digital Rights Management (DRM). With the important caveat that our method does not discuss unauthorized copying.

Suppose a company ("Rho") regularly releases data in a format that needs a special program ("Theta") for users to view or use the data. Theta might be made by the company ("Phi") that owns the format. In general, Phi is different from Rho.

One approach to prevent unauthorized changes is for the format to have some type of intricate encoding or encryption. This is used by Theta such that if it detects invalid data (caused presumably by unauthorized changes), then it will not display the data, or do whatever else it might normally do with valid data.

A fundamentally hard approach. Imagine Eve, who wants to make unauthorized changes to a data file. She has the file in her possession. Plus, she has a copy of Theta. The latter is a very good assumption if the format is widely used. Because that implies many copies of Theta, so Eve can get one of these. Eve can then try to make her changes in a way so that Theta will still display her results. With enough skill on her part, it is very difficult for Rho and Phi, acting together or separately, to prevent this.

Our method is far easier. Instead of preventing a change, we detect it. Rho can be a customer of the Agg, and furnish hashes of its valid data. Phi can write a version of Theta such that whenever Theta encounters a file, it queries the Agg.

As above, it finds a hash of the file, and sends (Rho, hash) to the Agg, asking if this is valid.

Our method is applicable for mass customization. Imagine a company with millions of customers. It can send out data specific to each customer, and have the hashes for these uploaded to the Agg. Using a hash table gives a fast lookup of these. [Cf. "The Art of Computer Programming" by D Knuth, Addison- Wesley, vol. 3.] Note that this is simply unfeasible for the situation where the company merely publishes web pages of hashes of its released data. That method does not scale into the millions of hashes.

Thus far, we have discussed the case where the Agg's corporate customers are implicitly major companies. However, the tag author notation can designate any website. Or, more generally, it could use another type of identifier. (A phone number, for example.)

There is the possibility that instead of Kappa classifying data into one of 3 states, there could be a 4th state. Imagine a document author that is not a customer of the Agg. Say it has a domain "someone.com". Kappa (or the Agg) could contact a Web Service at that site, and presenting the hash of the data. Hence, if someone.com replies that it has that hash, Kappa could classify the data as, say, "conditional". This would be a state less reliable than "valid" but more reliable than "invalid". We leave open the question of whether it is more or less reliable than "unvalidated".

More generally, there could be more than 4 states. For Kappa (or the Agg) might consult various resources that might have data about someone.com. For example, if an antispam website labels someone.com as a spammer, then this could translate into some value between "valid" and "unvalidated". Whereas if someone.com is considered to be a phisher, then another, worse value might be chosen.

An alternative is to realize that we are measuring along one axis a combination of two traits. Firstly, did a given company release the given data or not? Secondly, what is the reputation of this company. For simplicity, as above, it may be convenient to perform this combination on one axis. Or, it might be split into two axes. So that we get a two dimensional rating of the data. Kappa could then use other logic to classify the data, based on this representation.

We emphasize that our method does not preclude the use of existing antivirus methods. Rather, our method is a way to easily detect valid data from major, reputable companies. It could be used as a pre-filter to the current methods; offering a quick way to reduce the load on those methods. Plus, if users get to expect that most of the new data encountered by them should be validated by our method, then it reduces the chance that they will use data that is not validated. That in itself adds pressure on the virus writers. 2.3 Registrar We now discuss other extensions of "3878". In it, we described the general case of three parties: A sender ("Jane") and her ISP. An Agg, which we termed a Registrar, that holds hashes of Jane's outgoing messages, and possibly of some files that she disseminates. And a recipient ("Dinesh") and his ISR Here, by "ISP", we mean a user's message provider.

An important special case is where the Registrar is Jane's message provider. It has the simplifying consequence that Jane does not need a special plug-in to be integrated with her browser. Instead, she can use her browser to go to her provider's website, and there she types and sends a message. Her provider has logic that can then hash her message, or designated subsets thereof. Effectively, the plug-in gets collapsed into the Registrar. Hence, her provider could offer a Web Service that answered queries from external (or internal) recipients like Dinesh.

While "3878" did not explicitly address this case, it also did not explicitly exclude it.

Note that Dinesh may still need a plug-in at his browser, in order to compute a hash from Jane's message, and to then submit this and her electronic address, to Jane's provider. Or, perhaps, Dinesh gets his messages at an ISP that can perform these tasks. An extension is where the hash is written into the message in some fashion.

For example, if the message is email, the hash might be written into the header. Or, it might be written as a tag into the body.

We have used the term "hash" to designate the usage of a common hashing function against the message (or subsets thereof). More generally, this could be any function of the message. So that we can speak of an "id" associated with a message.

This id might even be independent of the message; and perhaps be chosen randomly. In this case, the id would have to be transmitted with the message, in some format that was made public. Dineslϊs plug-in (or his ISP's mail client) can use this known format to extract the id. It can then ask her provider (Registrar) if this pair (Jane's address, id) matches an entry in its records. The usage of an id that is independent of the message can lead to spoofing.

If a spammer gets a copy of the message, she can take the id and then send out fake messages. These have that id, and pretend to be from Jane. So this method has less efficacy than using a function of the message. But Jane's provider might still perhaps use it. One reason might simply be that it is harder for a spammer to get hold of a valid id (that is only used for that message) than it is for her to get hold of Jane's address. The former can be considered transient, the latter permanent. And this variant is faster, without the above need to find hashes.

(Though hashes may still need to be made of the ids, in order to hold them in a hash table at the Registrar, for fast lookup when queried.)

An extension is where the id or hash is written in the outgoing message, along with an expiration time for that id or hash. This might be written in some standard format, like Universal Time. It tells Dinesh's mail reading program that after that time, the Registrar will no longer validate the id or hash. So, for efficiency, if Dinesh reads the message after that time, his software will not attempt to, say, make a hash and ask the Registrar about it.

Also, if Dinesh replies to this message, after the id has expired, his software might not put the id into the reply. The default behavior might be to return the id with the reply. So that, for example, Jane's provider sees this id and lets the message get to Jane. While if the id is expired, the provider might discard the message. Hence, to avoid the latter, Dinesh's software might remove the id from his reply. One could imagine that Jane's message might have associated information with the id, that indicates what her provider's policy is with receiving messages with expired ids. Hence, Dinesh's software might run different logic, depending on this information.

An elaboration is if the expiration of an id is not done when Jane's provider gets a message with that id, and the message purports to be from someone on her white list of favored addressees. While it is possible for a spammer to fake one of the latter, in general the spammer will not possess that specific knowledge. 3. Anti-Spear Phishing and Anti-Pharming

There are two cases. A phishing message comes from outside the company or from inside the company.

If a message server at company.com gets a message from an external computer, that purports to be from company.com, then it should not be forwarded to its recipient. It should be discarded, or routed to a sysadmin for scrutiny, in order to detect a possible attack. But perhaps the company does not or will not check for this. It might have a policy of permitting incoming messages to claim to be from company.com. In general, this is undesirable, but suppose this is the situation. Then we effectively have the next case.

Now imagine that the message comes from (or purports to come from) a company computer. In ["2245", "2458"], we explained how the company can publish a Partner List (PL) of other companies' domains, where these domains, and only these, might appear in valid messages from the company. The PL is disseminated to external organizations, like ISPs or an Aggregation Center (Agg). Where an Agg might in turn make it available to a browser plug-in that can use these to validate messages.

Clearly, if the company is promulgating a PL externally, it can certainly do so internally. In effect, it acts as its own Agg, in this important special case. And it, acting as its own ISP for received messages, can apply the PL. Or the users' computers might have a plug-in that applies the PL against messages. By applying the PL in the manner described in "2245", messages claiming to be from company.com and which have unauthorized links to domains outside the PL can be easily detected. In "2245", we discussed how inside the company, it might choose not to do this. But against spear phishing, the company should apply the PL. An elaboration on the PL is also possible. In "2245", the domains in the PL are what we term "base domains". Essentially, these are the minimum set of rightmost fields in a domain name, that can be purchased. For example, ape.b53.com has the base domain b53.com, while g3.h00.co.uk has the base domain hOO.co.uk. Most medium sized or large companies might extend their base domain into subdomains. So imagine that company.com has it.company.com, hr.company.com, sales.company.com and some other domains. It can make several internal PLs, each corresponding to a subdomain. For example, it.company.com could have the PL {it.company.com, comp 1.com, tech2.com} , and hr.company.com could have the PL {hr.company.com}. The latter means that a valid message from hr.company.com can only have links to that domain (or possibly its subdomains).

So suppose a virus has infected deskl 5.company.com, and then it sends out messages pretending to be from hr.company.com. If the message does not have any links, then a reply by a recipient will go to hr and not deskl 5. But if the message has a link to desk 15 then it will be detected via the above PL for hr, as questionable (invalid).

This assumes that the virus has not been able to infect a computer inside hr. Suppose a computer in the latter domain is equally susceptible as a company computer outside hr. Even under these circumstances, by simple proportionality, the company has lessened its chances of being attacked. More realistically, in large companies, certain crucial departments (with attendant subdomains) are more likely to be very carefully secured. These are also the departments that a phisher is more likely to imitate. Hence our method can be of value in offering a layered defense. An elaboration offers even more safety. Consider someone ("Mark") at hr, sending a valid message to Lucy, where she is in the company, but not in Human Resources. Mark uses (perhaps has to use) a mail sending program that automatically finds the hash of his message body. Optionally, the text from which the hash is found might also include various header fields, like Mark's address in hr. This hash is sent to the company's server computer that will answer the query described below.

The program that Mark uses to write his message can insert a Notphish tag ["2458"] that has an attribute. The recipient's plug-in (or the company's mail server) sees this attribute and then finds the hash of the message. It queries the above server, presenting the hash, and asking if a message from hr has that hash. If so, then the message is considered valid. If not, then it is bogus. It did not come from Mark.

What if Amy pretends to be mark@hr.company.com, and just omits the tag? The company could require that all its computers have the plug-in, and that the plug-in requires the tag to be present in any message received from hr.company.com. Or, more simply, the tag might be omitted, and the plug-in will automatically hash any message received from hr.company.com, and then query the above server with this hash.

An alternative to using the notphish tag is to write an extended header in the message, if those types of messages permit this. (Email does.) Then, the plug-in sees this special header and performs the above.

Note that by representing the messages as hashes, it preserves the privacy of the messages. As well as being a very compact storage format.

Another possible scenario is if Amy sends a message claiming to be from joe@company.com. It goes to alice@company.com. The message text pretends to be a request from Human Resources, and contains links to hr.company.com, for verisimilitude. But it also has a link to a computer that Amy controls. If the link goes outside the company, then our standard use of a PL can detect this and invalidate the message. But suppose that link goes to deskl5.company.com? We apply the concept of the Restricted List (RL) from "2640". Imagine that the company defines its RL to include hr.company.com. Then if the plug-in detects that a message has any links to the latter, and links to domains not in the RL (like deskl5.company.com), it might classify the message as "MildWarning", which is not as severe as invalid. Plus, the plug-in might visually indicate in some fashion the latter link. In "2640", we used the RL primarily for users who receive mail at addresses outside company.com, where the mail purports to be from company.com. It can be seen that here, we are essentially applying the same RL internally.

We discussed the case where the company has one base domain. But our method can also be applied when the company has several base domains. Some of these might just be aliases to a given base domain. Others might have their own addresses and separate computers. This might arise if, for example, the company has several divisions that each maintains its own presence and identity on the web.

Another merit of our method is that it is a means of detecting a company computer that has been subverted by a vims, if that virus releases spear phishing messages that refer back to the computer. If a virus is present, it could also perform other types of attacks on the company. So having an extra means of detecting it is useful. Our method, considered as a virus detection technique, is entirely separate from, and complements, conventional antivirus methods. (Cf.

"Virus Research and Defense" by P Szor, Addison-Wesley 2005, for a comprehensive discussion of the latter methods.)

3.1 Anti-Browser Hijacking

The browser plug-in that we described here and in the Antiphishing Provisionals can also have its functionality extended to combat a type of browser hijacking. The latter can arise if the user gets a message with an attachment that ends up being run. Or possibly if a virus gets installed on the user's machine by whatever means. The attachment or virus waits till the user completes her login at a bank's website. (It might also have key logging ability that records her keystrokes.) The attachment has an internal record of various banks (or similar institutions), as well as information about how to recognize a successful login. Once the latter happens, the virus makes the browser session invisible and makes another window that the user uses. From the phisher's viewpoint, this circumvents any login mechanism like a hardware two factor. The virus is then able to empty the account.

The plug-in can implement several methods. It could indicate (warn) the user whenever a session is being made invisible. Or it could prevent such an action. (To be able to do this depends on the given browser.) More specifically, the plug-in might do this only when the user is at a domain in the plug-in's set of domains.

3.2 Anti-Pharming Man in the Middle Pharming is where a phisher sets up a website that pretends to be another website. The latter is usually a financial website. Imagine that we have a bank, if

BanlcO, with a (real) website, bankO.com. Amy sets up a fake website that imitates bankO.com. She then does various techniques to drive clients of BanlcO to her website, and once there, to enter in their personal data. So that she can drain their accounts or use that data to set up fake identities. Note that we use the term "pharming" in a slightly different meaning than that used by some others. For example, Wikipedia defines it as "the exploitation of a vulnerability in DNS server software that allows a cracker to acquire the domain name for a site, and to redirect, for instance, that web site's traffic to another web site." (Cf. en.wikipedia.org/wiki/Pharming.) Our meaning includes this, but we also include the more general case where, by whatever technical means, a phisher has set up a website ("pharm") that pretends to be another website.

Here, we consider the case where Amy performs a man in the middle (MITM) attack. We have the following configuration: Jane (Phi) <— -> Amy (Rho, Kappa) < > bankO.com (Theta)

The terms in the brackets are the raw network addresses of each party. To Jane, Amy pretends to be BanlcO, at address Rho. While to BankO, Amy pretends to be Jane, at address Kappa. In principle Rho and Kappa could be the same. But in general, Amy might have two (or more) machines, physically separate. One is Rho and the other is Kappa. Typically, she wants the machine that fakes BankO to be visited by as many people as possible, and she might invest some effort in making an effective fake website at Rho. It would be imprudent for her to directly contact BanlcO from that machine. Especially for many users. Since BankO might get suspicious if too many users try to login from the same machine. So in general, Amy would have several Kappas that contact BanlcO, where these might have addresses widely scattered on the network. Here, given one specific customer, Jane, we need only consider the case of one Kappa.

It is well known in cryptography that MITM attacks are very difficult to defeat. Countermeasures might be divided into two groups. The first tries to prevent Amy from changing any data that she gets, before passing it on. But it does not prevent Amy from recording this data and then perhaps attempting later to use it in some fashion advantageous to her. The second type of countermeasure tries to detect the presence of Amy before any information is handed to her that she can misuse. In "3879", we described a method that falls in the former category. In the current Provisional, we now describe methods in the latter category.

The MITM attacks might be mostly directed toward mobile computing. Imagine that Jane has a laptop, PDA, mobile phone or some other device capable of browsing the network. Without loss of generality, consider the case where the network is the Internet. Jane takes her mobile device to a hot spot, like a cybercafe or a hotel, say, and then logs in to the network. Her connection could be wireless or wired. In such unfamiliar surroundings, she can be vulnerable to pharming. Where Amy might have perhaps done DNS poisoning of the hot spot machine that connects Jane to the network. This poisoning might map "bankO.com" away from the real Theta address to Amy's Rho. We assume that at some earlier time, Jane and BankO share a secret key, K.

This may be her usual password, where she might normally log in with a username and password. Or it could be a special password established by them, to be used specifically in this method. This earlier time might be when she is logged into the network at her workplace or home. Where she is in a familiar environment and where her network provider might reasonably be considered to be trusted.

Jane's machine might have a browser plug-in, say, that has, at that earlier time while she was connected to the network in a more secure context, recorded the mapping from bankO.com to Theta. Hence, a simple check by the plug-in of Theta against the Rho will reveal a difference and thus alert Jane to the fake website.

Against this, Amy might resort to a more sophisticated attack. She manages to subvert the hot spot's software such that Jane sees a "pocket Internet". Where RJho = Theta. And where, for verisimilitude, most of Jane's other browsing is directed to those real websites.

Against this, our method does the following. When Jane goes to login at what she thinks is BankO, her plug-in computes the following hash - h(K+Phi). Here, h() designates a common hash function. The plug-in can find Phi and it already knows K. Typically, Phi is assigned to Jane's machine via some method like DHCP. Then, h(K+Phi) is passed to what is actually Amy. Suppose she passes it unchanged to BankO. Along with Jane's username. We imagine that BankO makes a login page, in which Jane or her plug-in has typed her usemame in the appropriate box. But instead, or in addition to, a box for the password, there is a box for the plug-in to write h(K+Phi). Due to the insecure context of this interaction, the bank might carefully NOT put a box where Jane can type her password. Because even if this is an https connection, Amy will get Jane's password in plain text, since Amy sits at the other end of that connection with Jane.

Now, given Jane's username, BankO can find K in its database. From the purported Jane (really Amy) at Kappa, it computes h(K+Kappa) and compares this to the received h(K+Phi). In general, Kappa is different from Phi, and so the two hashes will be different. This warns BankO that there appears to be a MITM. Hence, it will not send account data to Kappa.

Note that Kappa cannot be usefully concealed from BankO. It is the sender address in the packets that go from Kappa to BankO. Granted, Amy can change that sender field to anything she wants. If it is changed to another machine which she runs, then this is effectively the same as leaving it unchanged as Kappa. Suppose she changes the field to that of a machine she does not run. Then she does not get any reply sent by BankO, and her MITM attack fails. Above, we used the notation "K+Phi". Here the plus sign means some function of two variables, K and Phi. It does not necessarily mean arithmetic addition or the appending of the bits representing Phi to those representing K. Though it could. What is essential is that BankO and the programmers who coded the plug-in implement the same function, whatever that is, just as they need to choose the same hash function to use. Nor does the notation preclude the use of other parameters. Remember that at some earlier point, and not presumably at that hot spot, Jane established a valid connection to BankO. So at that time, BankO and the plug-in might have settled on other parameters to be used in the above computation.

The method is very simple to implement. And the hashing is faster than a PKI cryptographic approach. Plus, unlike a two factor hardware device, it does not need an accurate hardware clock in order to generate one time passwords.

The need for the latter clock can be a significant portion of the cost of those devices.

Also, the fact that the (username, hash) can be sent by Jane in plain text is useful. Because it renders the ability of Amy. to be at one end of an https connection with Jane moot. It also generalizes our method to other networks

(non-Internet) that might not have the equivalent of an https protocol to guard against evesdropping by third parties.

A variant on the above is that Jane or BankO might not want her to reveal even her regular usemame, on the principle of denying as much information as possible to an attacker. Instead, just as they established a common key at an earlier time, they might also have established a temporary username, for her to use when traveling.

Another variant is that at that earlier time, they might have established several keys or several temporary usernames.

En passant, note that the above method lets BankO detect if Amy is present. If so, BankO ends the login. But what if Amy goes ahead anyway, and pretends to Jane that the login was successful? In practice, this will fail. Because a successful login should invariably require that Jane sees details about her account that make sense. Amy does not have this information, a priori. Since that is the point of Amy attempting a MITM attack against Jane in the first place, to get such information.

However, there is still the possibility that Amy will present some type of warning page, which claims that Jane needs to reenter some of her personal data. This is "traditional" phishing, as opposed to a strict MITM attack. We can add an extra step to protect her. The real BankO would return another key, and Jane's plug-in would know that value and expect to see it in the reply. Irrespective of any textual information in the reply. The wrong key would cause the plug-in to alert Jane. Here, the key might be for one time usage. Or, equivalently, in a similar manner to the above, it might be a hash of the (key + another datum), where the datum is known to both Jane and BankO, or can be computed to be the same value.

When Jane gets a network address for her computer, if this is assigned by a program at the hot spot that has not been subverted by Amy, then in practical terms, the above method should suffice to detect and defeat Amy. Since Amy won't be able to set Kappa = Phi.

For Amy, one thing she might then try to do is write malware that can control the assigning of a network address to Jane. This is more involved, and this extra intricacy in itself acts to protect Jane, because it makes it more unlikely that a phisher could accomplish this step. But suppose somehow Amy is able to do this. So that in the full network, she picks a Kappa from which she will contact BankO, pretending to be a Jane customer. And in the hot spot subnet, Amy hands out a "Kappa" address to be Jane's Phi.

Jane and BankO can now do the following extra steps. They want to establish an encrypted channel between them. They can do this if they have a common key, which can then be used to do the encryption. Let T be a string or bit sequence to be sent from BankO to Jane. BankO can send the doublet (T, h(T+K))₅ for example. Amy reads T as plain text. But if she changes either item in the doublet or both items, then Jane can detect this. Because when Jane's plug-in gets the doublet, it takes the first argument, T, and independently computes h(T+K). If this is not the same as the second argument, then the message from BankO was tampered with. And she does not proceed further. So assume that Jane finds the message is untampered. Her plug-in combines T with K in some manner to find the key for encrypting a (new) channel between her and BankO. Where this manner of combining is also performed by BankO. As above, the programmers of the plug-in and BankO ensure that they are using the same method to get a key, and, of course, the same subsequent encryption method that uses the key. In the above, when we said that BankO sends a doublet to Jane, the reverse might happen. Their abilities are symmetric, in this respect.

When we wrote the doublet as (T, h(T+K)), the ordering of the items has no significance. We just chose a convention of making the first item T. Also, when we wrote that the second item in the doublet is h(T+K), as earlier, the plus sign stands for some agreed upon function of T and K that both BankO and Jane will do.

We described the reusing of K (which might conventionally be Jane's password). Clearly, Jane and BankO might have earlier predefined a separate password to be used in. this context.

We described the use of a symmetric key encryption method, where the distinguishing feature is how that key is obtained. Clearly, instead of symmetric key encryption, PKI might be used. But PKI is typically used to get around the problem of distributing a key to be shared by two parties. Here, the parties already have access to a shared key.

One issue that arises is how do BankO and Jane decide to use this channel encryption method? Obviously, in our first method, if BankO discovered the presence of Amy, then it might decide to indicate to Jane to use the channel encryption. Amy might have the ability to break this communication, but she cannot decipher it.

So now suppose in our first method, Amy was not detected. Should BankO then allow a conventional login by Jane, or should it insist on the channel encryption? There could be logic external to this Provisional that it uses. In addition to which, Jane might insist on using the second method. A policy decision that she could set for her plug-in. Perhaps she is traveling, and wants to be cautious in an unfamiliar hot spot.

A variant of the channel encryption method involves a different choice of K.

Instead of reusing the earlier K from the first method, both Jane and BankO use

Y(t), where Y is a function of the time, t. Here, it is assumed that Jane and BankO have separate implementations of a stable clock, that were synchronized at some earlier time. And that the drift in the clocks over the duration that they will be used is small enough that both parties will generate the same Y(t). More generally, Y might be a function of both K and t.

Above, we have discussed Jane bringing her mobile computer to some hot spot. Other situations are possible. For example, the steps in the previous paragraph might be implemented where "Jane" is an Automated Teller Machine in some fixed location. BankO owns it, or is one of the banks or credit card agencies that the ATM needs to communicate securely with.

Another situation is where Jane is a person, but instead of her carrying a mobile computer around, she might instead carry a fob that can plug in to a computer at a cybercafe or library. The fob might have what we have referred to above as a plug-in to a browser.

More generally, our channel encryption method can be used in a situation where two parties have a shared secret and wish to defend against a MITM attack.

This is an extension of "0046", where we discussed two ISPs that want to see if the other has received a canonically identical message. That method did not involve the transmission of the original message between them. (The message corresponds to the key in the context of this Provisional.) Instead, one ISP would send the hashes of the canonically reduced message to the other, and ask if the latter has seen those hashes in its messages. If so, the first ISP might ask the second to return the hashes found by starting from a given offset into the reduced message. This offset is sent by the first ISP. This method is used by the first ISP to ascertain that the second ISP has indeed seen the message, instead of the latter just saying "yes". The offset corresponds to the T in the doublet (T, h(T+K)) used above. In "0046", the communication between the ISPs could be in plain text, because an evesdropper gains very little useful information, if any. And to the extent that she finds information, there is little direct financial advantage to be gained from it, if any. So in "0046", a MITM attack has minor significance.

Another reason is that ISPs may be reasonably expected to have the personnel and resources to guard against such malware attacks as DNS poisoning. And ISPs are often closer to the network backbone, which is more heavily defended against attacks than a hot spot on the peripheiy of the network.

REFERENCES CITED Antiphishing Working Group, antiphishing.org en.wikipedia.org/wiki/Phishing en. wikipedia.org/wiki/Pki - for Public Key Infrastructure en.wikipedia.org/wiki/MITM - for Man In The Middle attacks en.wildpedia.org/wiki/BPEL - for Business Process Execution Language w3.org/TR/wsdl - for Web Services Description Language "Vims Research and Defense" by P Szor, Addison- Wesley 2005

"The Art of Computer Programming - Vol. 3" by D Knuth, Addison- Wesley 1998 truste.com - for public seals verisign.com - for public seals

Claims

1. A method of a company telling an Aggregation Center (Agg) of valid electronic addresses at which the company can receive electronic payments, and where the Agg will promulgate these and other information about the company on an electronic network (like the Internet).

2. A method, using claim 1 , where the company writes one or more of these payment addresses into electronic messages or web pages, where the messages or web pages also have a "Notphish" tag, with an attribute that identifies the company, like its Internet domain name.

3. A method, using claim 2, where a browser plug-in analyses a message or web page, sees the Notphish tag, finds a payment address, and confirms this with an Agg, as valid or invalid.

4. A method, using claim 3, where the plug-in can instruct the Agg to transfer funds between a user's account and a company's payment address, for the company whose electronic message or web page has been validated by the plug-in.

5. A method of an Agg getting information from a company ("V") that issues electronic seals, where the information might include any of: list of company's customers that can display a seal on their web pages or messages; is a seal clickable?; must a clickable seal link to V?

6. A method, using claim 5, where a plug-in parses a web page or message, and if it finds a seal, then it asks the Agg for seal information, and compares this against observed properties of the seal and its enclosing page or message, to validate or invalidate the seal, page or message.

7. A method, using claim 5, where an electronic message provider (like an ISP) parses incoming or outgoing messages, and if it finds a seal, then it asks the Agg for seal information, and compares this against observed properties of the seal and its enclosing message, to validate or invalidate the seal or message.

8. A method, using claim 7, where if the provider detects an invalid seal in an incoming message, it deletes the message or puts it into the recipient's Bulk or Trash folder.

9. A method of a company that makes a digital file to be disseminated, writing a "Notvirus" tag into the file, where the tag has some attribute that identifies the company, like its Internet domain name.

10. A method, using claim 9, of the company telling an Agg the hashes of files made with the Notvirus tag, along with the company's attribute that is in tags in those files.

1 1. A method, using claim 9, of a program run by a user, that takes a file, looks for a Notvirus tag, and if it exists, extracting the company name, finding a hash of the file, and presenting the name and hash to an Agg, asking the latter if the hash corresponds to a valid company file, and designating the reply to the user in some manner, including perhaps warning of an invalid tag, that suggests a bad file.

12. A method, using claims 10 and 11, where the Agg responds appropriately to that network query.

13. A method of a browser plug-in warning the user if any instances of the browser are being made invisible, perhaps restricted to cases where the instance is showing a page from a special set of domains (those of banks, for example).

14. A method, using claim 13, where optionally those domains are gotten from an Agg.

15. A method of a bank and a customer sharing a common secret (password),

K, associated with her username; and where, when she logs in across the network, she has software on her machine that finds h(K+Phi), where Phi is her network address, and h() is some hash function, used by the bank and the customer's software, and the software sends h(K+Phi) to the bank, along with her username.

16. A method, using claim 15, where the bank uses the usemame to find K, and then it computes h(K+Rho), where Rho is the network address from which Jane communicates; the bank compares this with h(K+Phi) that it gets from Jane; if these are different, it suggests a Man In The Middle attack, possibly with a phisher running a pharm pretending to be the bank.