[go: up one dir, main page]

US20080005081A1 - Method and apparatus for searching and resource discovery in a distributed enterprise system - Google Patents

Method and apparatus for searching and resource discovery in a distributed enterprise system Download PDF

Info

Publication number
US20080005081A1
US20080005081A1 US11/477,021 US47702106A US2008005081A1 US 20080005081 A1 US20080005081 A1 US 20080005081A1 US 47702106 A US47702106 A US 47702106A US 2008005081 A1 US2008005081 A1 US 2008005081A1
Authority
US
United States
Prior art keywords
repository
classifier
resource
classifiers
added
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/477,021
Inventor
Stephen J. Green
Paul B. Lamere
Jeffrey L. Alexander
Karl R. Haberl
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Microsystems Inc
Original Assignee
Sun Microsystems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems Inc filed Critical Sun Microsystems Inc
Priority to US11/477,021 priority Critical patent/US20080005081A1/en
Publication of US20080005081A1 publication Critical patent/US20080005081A1/en
Priority to US12/557,403 priority patent/US7949660B2/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually

Definitions

  • This invention relates to resource search and discovery systems that operate in a distributed enterprise system, such as a computer network or an intranet. These systems are conventionally called enterprise search systems (ESSs). Such a system might be available to users working in the enterprise or might be used as a search tool available to users outside of the organization via a mechanism such as a company website.
  • ESSs enterprise search systems
  • ESSs are centralized in that there is typically one application that is responsible for collecting content from the enterprise network or intranet.
  • This application is commonly known as a “spider” or “robot” and locates documents.
  • An indexer then indexes the content of those documents.
  • other applications allow users to query the index, for example, via a web-based query interface.
  • the content of the index in the ESS can rapidly become out-of-date with respect to the actual documents available on the system. Documents that are removed from the system result in “dead links” in search results, which leads to frustration for searchers. At the same time, new content is not included in the index immediately: it must wait until it is located by the spider. This delay can lead to duplication of intellectual effort if, for example, a problem must be re-solved.
  • a desktop search engine is a program that operates a desktop to index personal content such as e-mail messages, visited web pages, and local documents in a variety of formats.
  • Such a local search system could be used, for example, to search an archive of e-mail messages sent to a number of aliases related to a given project.
  • Other local search systems may involve a search engine running on a server shared by a group of users. These local search systems locate more timely and up-to-date content, but move the burden of system administration to the people creating the content.
  • local search systems often use differing technologies, such systems lead to a proliferation of search technologies within the enterprise. Attempting to discover the existence of these local systems and then to reconcile the search results produced by a number of different search engines is a difficult problem.
  • a typical strategy used to make sure that the information is available in the enterprise is to include the information on a web server, and then to attempt to make the spider visit that web server.
  • personal data repositories are created by individual users or groups of users. Users identify resources that they want to keep and may suggest a few keywords that describe each resource. Classifiers can then be generated from collections of resources that have been assigned the same keyword. Each generated classifier also specifies a target repository into which a resource that matches that classifier is added. Users can submit classifiers that they created to other repositories. Later, when a user adds a resource to a repository, it is checked against all classifiers that have been created by the user and all classifiers submitted to the repository by other users. If the new resource matches any classifier, the resource is added to the repository specified in the classifier.
  • the resources are tagged with one or more keywords. This allows classifiers to be continually built and tested against a particular set of user keywords.
  • users are notified when resources are added to their repositories so that they can review the keywords that have been assigned to each resource and perhaps assign different or additional keywords.
  • users poll other repositories to determine when those other repositories have added resources to their repositories.
  • a first user associated with a first repository could request that a second repository associated with a second user simply notify the first repository whenever the second repository adds new resource content to a particular category in the second repository.
  • a user can simply forward a classifier that he or she has constructed to another repository for use in that repository. Further, a user can also forward an entire classifier tree, or trees, to another user to enable that other user to classify content in the same manner as the first user
  • FIG. 1 is a block schematic diagram of a conventional distributed enterprise system in which users have assembled private resource repositories.
  • FIG. 2 is a more detailed block schematic diagram of a private repository constructed in accordance with the principles of the invention.
  • FIG. 3 is a block schematic diagram of a conventional process for automatically generating document classifiers from a training set of documents.
  • FIG. 4 is a schematic diagram that represents a typical manner in which features and associated weights can be used to represent a document as weighted feature vector.
  • FIG. 5 is a schematic diagram that represents features and associated weights that can be used to represent a classifier as weighted feature vector including a target repository for archiving the document content.
  • FIG. 6 is a schematic diagram that illustrates processing of classifiers to generate meta-classifiers.
  • FIG. 7 is a schematic diagram that illustrates a binary classification tree produced by the process illustrated in FIGS. 6 and 7 .
  • FIG. 8 is a block schematic diagram illustrating the submission of a classifier from one repository to another repository.
  • FIG. 9 is a flowchart showing the steps in an illustrative process for submitting a classifier from one repository to another repository.
  • FIG. 10 is a block schematic diagram of a process used by a document manager for automatically classifying a new document using a classification tree as illustrated in FIG. 7 .
  • FIG. 11 is a flowchart showing the steps in an illustrative process for classifying a new document by comparing a vector representation of the document to meta-classifiers in a classification tree as illustrated in FIG. 7 .
  • FIG. 1 shows a conventional enterprise computer system 100 that includes computers 102 - 112 connected together by an intranet 114 as schematically illustrated by arrows 116 - 126 .
  • resource repositories are personal to each user.
  • resource repositories 128 and 130 are associated with users working on computer 102 .
  • resource repository 132 is associated with a user operating on computer 104 .
  • Resource repositories 134 and 136 are associated with users working on computer 108 and resource repository 138 is associated with a user operating on computer 110 .
  • each repository is assigned a Uniform Resource Identifier (URI) to identify that repository for communication, as described below, with other repositories.
  • URI Uniform Resource Identifier
  • Each of repositories 128 - 138 can retrieve, store and index resources.
  • a more detailed block diagram of a typical repository 200 is shown in FIG. 2 .
  • Other repositories in the system would have the same, or a similar, configuration.
  • Repository 200 retrieves information from a resource by means of a connector that is particular to that resource. The information is then added to the repository by adding either a reference to the information or a copy of the content of that information to the repository. If a copy of the information content is retrieved and archived, then the information will still be available if the original source disappears.
  • Several illustrative connectors are shown in FIG. 2 and others would be known to those skilled in the art. These connectors can retrieve information periodically or one time only.
  • connector 202 can fetch a web page 204 located at a designated Uniform Resource Locator (URL) and retrieve the content of that page. The retrieved content is presented to document manager 238 as indicated schematically by arrow 203 .
  • connector 206 may monitor a particular folder 208 residing on a file system. Any changes in that folder will be retrieved by connector 208 and their contents provided to document manager 238 .
  • connector 206 may monitor a particular directory 208 in a file system, retrieving the content of any files placed in the directory.
  • a connector such as connector 210 may monitor a “syndication feed” 210 , such as a Really Simple Syndication (RSS) feed or an Atom feed, retrieving all articles from the feed and presenting the content to document manager 238 .
  • a connector such as connector 214 , can monitor an Internet Message Access Protocol (IMAP) folder so that electronic mail or bulletin board messages added to the folder are retrieved. This monitoring function can also be used where document manager 238 has its own email address.
  • IMAP Internet Message Access Protocol
  • Document manager 238 stores the references or content that it receives from connectors 202 , 206 , 210 and 214 in an archive 254 as indicated schematically by arrow 256 .
  • retrieved content can be archived to ensure that any search returns links to resources that can be retrieved from their original location, if the original resource still exists, or from the archive 254 , if the original resource does not exist.
  • One of the capabilities of a repository is that the incoming resources can be evaluated against classifiers in the repository in order to suggest to the user one or more keywords or categories for each resource that is being placed into the repository. This evaluation generally happens before a user has the opportunity to assign keywords to the resource so that resources that match classifiers can be tagged with appropriate keywords and presented to the user. The user would then typically verify that the system-generated keywords are correct or assign new or additional keywords.
  • the classifiers can be manually built by the user or the classifiers can be automatically and dynamically created and maintained.
  • incoming resource content can be indexed by indexer/search engine 253 and then provided to a classifier generator 250 in order to provide classifiers for the information in the archive 254 .
  • incoming resource information that is not automatically classified into any existing category is placed in an unclassified category.
  • this category contains a predetermined number of resource information documents
  • the user is notified and a classifier for that category can be constructed.
  • An exemplary arrangement for automatically generating this classifier is illustrated in FIGS. 3-7 .
  • the unclassified category contains the predetermined number of documents
  • the user associated with that repository is notified, for example, by means of a user interface 218 .
  • the user can then manually create a new category and rate each document in the unclassified category indicating whether it is relevant or not relevant to the new category.
  • the documents can be provided to an indexer/search engine 253 as schematically indicated by arrow 252 .
  • the indexer then indexes the documents.
  • the documents, ratings and indexes are then applied to a classifier generator 250 from the indexer search engine 253 as indicated schematically by arrow 251 .
  • FIG. 3 schematically illustrates a conventional process that can be used by classifier generator 250 for automatically building a classifier from a set of “training” documents 300 that have been manually rated as relevant or not relevant for a selected category.
  • text classifiers are used for illustrative purposes.
  • inventive principles can be extended to other types of documents in a straightforward manner using known feature extraction algorithms.
  • IR information retrieval
  • each vector has multiple components, each of which, in turn, comprises a text feature extracted from the document and a numerical weight associated with that feature.
  • the first step 302 in generating a classifier is to process the training documents 300 in order to extract the content features associated with the vector representing each document and to assign a numerical weight to each feature.
  • This is a conventional process in which a stripper is used to remove any formatting information and graphics from the document, producing a stream of plain text.
  • a parser parses the plain text stream into words or word combinations are features in the IR system. Some mechanism, such as the frequency of occurrence of the feature in the document, is used to assign a weight to each feature.
  • the result is a vector such as that illustrated in FIG. 4 for each training document.
  • the vector 400 comprises features 402 , 404 , 406 and 408 that are selected from the features used in the IR system.
  • Each feature 402 - 408 has an associated weight 410 , 412 , 414 and 416 .
  • Each feature in the system may be represented or, alternatively, features whose weight is zero may be omitted from the vector 400 .
  • a vector is generated for each training document to generate a set of vector representations 304 .
  • the vector representations 304 are provided to a classifier generation algorithm, of which many are well-known.
  • Two such algorithms are the Rocchio Algorithm described in “Learning Routing Queries in a Query Zone”, A. Singhal, M. Mitra and C. Buckley, SIGIR ' 97 : Proceedings of the 20 th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, N.Y. (1997) and the K-Nearest Neighbors Algorithm. See, for example, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval”, Y.
  • Classifiers such as classifier 308 are also generally represented as a weighted feature vector, where each feature is a feature selected from the training documents for the classifier. Such a vector is illustrated in FIG. 5 .
  • features 502 , 504 , 506 and 508 are associated with weights 510 , 512 , 514 and 516 , respectively.
  • vector 500 also contains a target repository 518 identifier which may be the URI assigned to that repository.
  • the target repository would be the repository in which the document manager that classifies the document resides.
  • the identified repository could also be another repository to which the classified document is sent.
  • the new classifier generated by the classifier generator 250 is stored in a classifier store and tree generator 242 as illustrated schematically by arrow 248 .
  • the dot product process produces a value and if this dot product value exceeds a threshold which is specific to the classifier, then a match is determined. Otherwise, no match occurs.
  • Weighted feature vectors that represent the classifiers can be treated as special “documents” and processed accordingly. For example, clusters of “classifier documents” that contain similar features can be generated. Clustering can be performed by simply counting the number of features that two classifiers have in common or by computing the similarity of the vector representations using conventional clustering algorithms, such as the single-linkage clustering algorithm and the k-means clustering algorithm. A variety of other conventional algorithms that are suitable for use with the invention are discussed in “Data clustering: a Review”, A. K. Jain, M. N. Murty and P. J. Flynn, ACM Computing Surveys , v. 31, n. 3, pages 264-323, ACM Press, New York, N.Y. 1999. This process is illustrated in FIG.
  • Classifier store and tree generator 242 might, for example, be configured to re-generate classifier trees each time a new classifier is added to the classifier store 242 .
  • the process begins by applying classifiers 600 to 602 to a clustering algorithm 604 . The result is that some classifiers will be clustered as indicated by clusters 606 and 608 whereas other classifiers will remain single as illustrated at 610 .
  • the clustering algorithm 604 is designed to produce a cluster from pairs of classifiers so that a binary tree will result when the process is finished.
  • a cluster of classifiers such as clusters 606 and 608 can be considered a set of training documents and a higher-level classifier, or meta-classifier, can be generated from the cluster of classifier documents using the same classifier generation process illustrated in FIG. 3 .
  • classifier cluster 606 can be applied to classifier generator 612 to generate meta-classifier 616 .
  • classifier cluster 608 can be applied to generator 614 to generate meta-classifier 618 .
  • meta-classifiers such as meta-classifiers 616 and 618
  • unclustered classifiers 610 can then be clustered, via a clustering algorithm 620 , which can be the same algorithm as algorithm 604 or a different algorithm.
  • the clustering algorithm is designed to produce a cluster from two meta-classifiers.
  • the result is a cluster of meta-classifiers of which cluster 622 is illustrated.
  • the meta-classifier clusters can again be applied to a classifier generator 626 , which can be the same as the generators 612 and 614 , or different, to generate a meta-meta-classifier 628 .
  • the meta-meta-classifiers together with any un-clustered classifiers 624 are, in turn applied to another clustering algorithm 630 . This process is repeated until a single root classifier 632 is generated.
  • the result of the process illustrated in FIG. 6 is a binary tree hierarchy (or possibly a small forest) of classifiers built from classifiers (which were built from classifiers . . . ) as shown in FIG. 7 .
  • the binary tree 700 has root classifier 702 node as its highest level.
  • the tree 700 is constructed so that each node has two nodes that can be selected.
  • the classifier associated with a node indicates a match for a new document, at most two other classifiers must be checked for a match. For example, if root classifier 702 matches a document, then, at most, meta-classifiers 704 and 706 need be checked for matches.
  • meta-classifier 704 indicates a match, then classifier 706 need not be checked. Instead, meta-classifiers 708 and 710 are checked. However, if meta-classifier 704 does not match a new document, then meta-classifier 706 is checked for a match. If a match is obtained with meta-classifier 706 , then one or both of meta-classifiers 712 and 714 are checked.
  • level 720 all of the nodes, of which nodes 722 - 732 are shown, consist of single classifiers.
  • the meta-classifier associated with the node in the previous tree level that did produce a match is used to place the document in a category. For example, if a match is obtained with the meta-classifier in node 704 , but neither the classifier 708 nor the meta-classifier 710 produces a match, then meta-classifier 704 is used to categorize the document.
  • the classifier trees generated by the classifier store and tree generator 242 are provided to the document manager 238 as indicated schematically by arrow 246 for use by the document manager 238 in classifying incoming resource content.
  • a user may submit classifiers that have been generated and stored in the classifier store 242 to other repositories.
  • each repository maintains and publishes a web server, such as web server 236 , to which other repositories can submit classifiers.
  • this classifier could be communicated to a second user's repository.
  • the user via the user interface 218 , the user can select a classifier and control the classifier store and tree generator 242 to transfer the selected classifier to the web interface 226 as indicated schematically by arrow 224 .
  • step 900 the process starts in step 900 and proceeds to step 902 where a first user associated with repository 1 ( 800 ) locates the web server 820 in repository 802 to which the first user desires to submit a classifier. The process then proceeds to step 904 where a determination is made whether additional classifiers will be submitted to repository 802 .
  • step 908 the next classifier to be transferred is selected, for example, classifier 804 .
  • classifier 804 is provided to web interface 808 in repository 800 as indicated by arrow 806 .
  • step 910 the classifier is submitted to the web server 816 .
  • web interface 808 transfers the classifier to the location of web server 820 , as previously determined, as indicated schematically by arrow 810 , for example, via the Internet 812 or some other network, to the web interface 816 as indicated by arrow 814 .
  • the process then returns to step 904 to determine whether further classifiers remain to be submitted. If so, processing continues in the manner discussed above. If no further classifiers remain to be submitted, then the process terminates in step 906 .
  • a classifier arriving at web interface 226 from another repository is provided to the web server 236 as indicated by arrow 222 .
  • the web server 236 can then enter that classifier into the classifier store 242 as indicated by arrow 240 .
  • the web server 236 may also trigger the classifier store and tree generator 242 to regenerate the classifier trees from the contents of the classifier store including the classifier received from the other repository.
  • the classifier store and tree generator 242 can regenerate the classifier trees on a predetermined schedule.
  • FIG. 8 classifiers arriving at web interface 816 are provided to web server 820 as indicated by arrow 818 . From web server 820 , the classifiers are entered into the classifier store 824 as indicated by arrow 822 .
  • classifier trees in the classifier store 824 After the classifier trees in the classifier store 824 have been rebuilt, when new resource content is added to repository 802 , it is classified by document manager 826 using the classifier trees that include the classifiers that were submitted by the first user.
  • FIGS. 10 and 11 The process of using a classifier tree to place a new document, representing new content from a resource into a category is shown in FIGS. 10 and 11 .
  • This process starts in step 1100 and proceeds to step 1102 where a new document 1000 is parsed to identify and weigh important content features 1002 in a conventional manner.
  • these features are used to represent the new document as a vector 1004 also in a conventional manner.
  • the vector representation 1004 is then applied to the classification tree 1006 that is constructed as described above.
  • the process of using the classification tree 1006 is shown in steps 1106 - 1114 .
  • the process starts at the root classifier of the tree in order to check whether the new document should be classified into the given tree.
  • step 1106 a determination is made whether the root classifier generates a match. If no match occurs, the process proceeds to finish in step 1116 because the classification tree is not applicable to the new document.
  • Other classification trees may then be used.
  • step 1108 the next lower level in the classification tree is selected. Since the tree is a binary tree, this next level will have two meta-classifiers.
  • step 1110 the “left” classifier of the two classifiers is checked. If there is a match, the “right” classifier need not be checked. Instead, the process proceeds back to step 1108 to select the next lower level of the classification tree that branches from the currently selected node. Thus, the search process proceeds in a “depth-first” manner.
  • step 1110 If, in step 1110 , no match occurs, then, in step 1112 , the “right” classifier of the meta-classifier pair is checked for a match. If a match occurs, the process returns to step 1108 where the next lowest level of the classification tree from the selected node is now selected and the process repeated.
  • step 1112 If no match occurs at step 1112 , then the lowest level of the tree at which the classifiers are applicable has been reached. At this point, a cluster of classifiers against which the new document should be evaluated is determined by the matching meta-classifier of the previous level. The classifiers in this cluster are then evaluated to determine the category or categories in which the document will be placed, as set forth in step 1114 . The process then finishes in step 1116 . The result is a document category or categories 1008 .
  • the document manager processing that resource content sends the content to an archive in the target repository identified by that classifier.
  • the target repository will be the repository in which the document manager is located.
  • the document manager 238 sends the content to archive 254 as indicated by arrow 256 .
  • document manager 238 provides the document content to web interface 226 as indicated by arrow 244 .
  • the content is then sent to the other repository identified by, for example, a URI specified in the classifier.
  • Each repository monitors its identified address to check for the arrival of new content.
  • repository 200 has a connector 232 that monitors its URI as indicated schematically by 234 .
  • connector 232 transfers this content to document manager 238 as indicated by arrow 230 in the same manner that connectors 202 - 214 transfer content to document manager 238 .
  • the content is indexed and stored in archive 254 in the manner discussed above.
  • This mechanism provides for a group information discovery mechanism. When a first user finds content that would interest a second user, the content is automatically placed in the second user's repository.
  • the document manager that identifies the content can alert the second user. For example, as illustrated in FIG. 8 , upon identifying content that matches a classifier submitted by repository 800 , document manager 832 can instruct notifier 828 as indicated by arrow 834 to send a notification to repository 800 via web interface 816 . Web interface 816 sends the notification as indicated schematically by arrow 836 , Internet 812 and arrow 838 to web interface 808 . Alternatively, document manager 826 can post the notification to public distribution area, such as conventional syndication feed 830 . Repository 800 can then poll the syndication feed 830 , via web interfaces 808 and 816 , to discover when new content has been added to repository 800 .
  • a first user associated with a first repository could request that a second repository associated with a second user simply notify the first repository whenever the second repository adds new resource content to a particular category in the second repository. This could be accomplished by the first user submitting a classifier to the second repository, which classifier identifies content for the category. In this manner the first user is notified about anything that the second user finds interesting in a particular category.
  • the inventive system can be used in situations where a user works on materials that, for security reasons, he or she cannot allow anyone else to see. This user may have spent a great deal of effort on the construction of a classifier for a particular class of documents. He could decide to share his classifier for this class of documents without having to share the documents from which the classifier was constructed. This can be accomplished by making a copy of the classifier and changing the target repository identifier to identify a repository with which the user wishes to share information. The user then instructs the document manager to notify the other repository when relevant information is identified, but not to forward the content to the other repository. Alternatively, a user can simply forward a classifier that he or she has constructed to another repository for use in that repository. Further, a user can also forward an entire classifier tree, or trees, to another user to enable that other user to classify content in the same manner as the first user.
  • a software implementation of the above-described embodiment may comprise a series of computer instructions either fixed on a tangible medium, such as a computer readable media, for example, a diskette, a CD-ROM, a ROM memory, or a fixed disk, or transmittable to a computer system, via a modem or other interface device over a medium.
  • the medium either can be a tangible medium, including but not limited to optical or analog communications lines, or may be implemented with wireless techniques, including but not limited to microwave, infrared or other transmission techniques. It may also be the Internet.
  • the series of computer instructions embodies all or part of the functionality previously described herein with respect to the invention.
  • Such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including, but not limited to, semiconductor, magnetic, optical or other memory devices, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, microwave, or other transmission technologies. It is contemplated that such a computer program product may be distributed as a removable media with accompanying printed or electronic documentation, e.g., shrink wrapped software, pre-loaded with a computer system, e.g., on system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, e.g., the Internet or World Wide Web.
  • a removable media with accompanying printed or electronic documentation, e.g., shrink wrapped software, pre-loaded with a computer system, e.g., on system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, e.g., the Internet or World Wide Web.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In a distributed enterprise computing system, personal data repositories are created by individual users who generate classifiers that index the information for those repositories. Each generated classifier also specifies a target repository into which a copy of a resource that matches that classifier is placed. Users can submit classifiers that they created to other repositories. Later, when a user adds a resource to his personal repository, it is checked against all classifiers that have been created by the user and submitted to the repository by other users. If the new resource matches any classifier, a copy of the resource is sent to the repository specified in the classifier, where the resource is archived and indexed.

Description

    BACKGROUND
  • This invention relates to resource search and discovery systems that operate in a distributed enterprise system, such as a computer network or an intranet. These systems are conventionally called enterprise search systems (ESSs). Such a system might be available to users working in the enterprise or might be used as a search tool available to users outside of the organization via a mechanism such as a company website.
  • Currently, ESSs are centralized in that there is typically one application that is responsible for collecting content from the enterprise network or intranet. This application is commonly known as a “spider” or “robot” and locates documents. An indexer then indexes the content of those documents. Subsequently, other applications allow users to query the index, for example, via a web-based query interface.
  • There are a number of problems that such a centralized approach creates. First, in the general case, the people responsible for the administration of the ESS are not the people who are creating the content. This means that an administrator cannot easily tell whether a particular document should be included in the index or not. Thus, the administrators of a centralized system tend to spend most of their time making sure that the machines stay running, that the spider does not run out of control or become hung, and that search results for common queries are relevant.
  • Second, the content of the index in the ESS can rapidly become out-of-date with respect to the actual documents available on the system. Documents that are removed from the system result in “dead links” in search results, which leads to frustration for searchers. At the same time, new content is not included in the index immediately: it must wait until it is located by the spider. This delay can lead to duplication of intellectual effort if, for example, a problem must be re-solved.
  • Third, because people cannot find the information that they need, local search systems start to appear. For example, there are currently many desktop search engines available. A desktop search engine is a program that operates a desktop to index personal content such as e-mail messages, visited web pages, and local documents in a variety of formats. Such a local search system could be used, for example, to search an archive of e-mail messages sent to a number of aliases related to a given project. Other local search systems may involve a search engine running on a server shared by a group of users. These local search systems locate more timely and up-to-date content, but move the burden of system administration to the people creating the content. Furthermore, because local search systems often use differing technologies, such systems lead to a proliferation of search technologies within the enterprise. Attempting to discover the existence of these local systems and then to reconcile the search results produced by a number of different search engines is a difficult problem.
  • Fourth, people who have content that they would like to make available in the enterprise have no easy way to ensure that this content is included in the ESS. A typical strategy used to make sure that the information is available in the enterprise is to include the information on a web server, and then to attempt to make the spider visit that web server.
  • SUMMARY
  • In accordance with the principles of the invention, personal data repositories are created by individual users or groups of users. Users identify resources that they want to keep and may suggest a few keywords that describe each resource. Classifiers can then be generated from collections of resources that have been assigned the same keyword. Each generated classifier also specifies a target repository into which a resource that matches that classifier is added. Users can submit classifiers that they created to other repositories. Later, when a user adds a resource to a repository, it is checked against all classifiers that have been created by the user and all classifiers submitted to the repository by other users. If the new resource matches any classifier, the resource is added to the repository specified in the classifier.
  • In one embodiment, as resources are added to a repository, the resources are tagged with one or more keywords. This allows classifiers to be continually built and tested against a particular set of user keywords.
  • In another embodiment, users are notified when resources are added to their repositories so that they can review the keywords that have been assigned to each resource and perhaps assign different or additional keywords. In still another embodiment, users poll other repositories to determine when those other repositories have added resources to their repositories.
  • In yet another embodiment, instead of actually transferring references or content between repositories, a first user associated with a first repository could request that a second repository associated with a second user simply notify the first repository whenever the second repository adds new resource content to a particular category in the second repository.
  • In another embodiment, in situations where a user works on materials that, for security reasons, he or she cannot allow anyone else to see, the user could decide to share his classifier for a class of documents without having to share the documents from which the classifier was constructed.
  • In still another embodiment, a user can simply forward a classifier that he or she has constructed to another repository for use in that repository. Further, a user can also forward an entire classifier tree, or trees, to another user to enable that other user to classify content in the same manner as the first user
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block schematic diagram of a conventional distributed enterprise system in which users have assembled private resource repositories.
  • FIG. 2 is a more detailed block schematic diagram of a private repository constructed in accordance with the principles of the invention.
  • FIG. 3 is a block schematic diagram of a conventional process for automatically generating document classifiers from a training set of documents.
  • FIG. 4 is a schematic diagram that represents a typical manner in which features and associated weights can be used to represent a document as weighted feature vector.
  • FIG. 5 is a schematic diagram that represents features and associated weights that can be used to represent a classifier as weighted feature vector including a target repository for archiving the document content.
  • FIG. 6 is a schematic diagram that illustrates processing of classifiers to generate meta-classifiers.
  • FIG. 7 is a schematic diagram that illustrates a binary classification tree produced by the process illustrated in FIGS. 6 and 7.
  • FIG. 8 is a block schematic diagram illustrating the submission of a classifier from one repository to another repository.
  • FIG. 9 is a flowchart showing the steps in an illustrative process for submitting a classifier from one repository to another repository.
  • FIG. 10 is a block schematic diagram of a process used by a document manager for automatically classifying a new document using a classification tree as illustrated in FIG. 7.
  • FIG. 11 is a flowchart showing the steps in an illustrative process for classifying a new document by comparing a vector representation of the document to meta-classifiers in a classification tree as illustrated in FIG. 7.
  • DETAILED DESCRIPTION
  • FIG. 1 shows a conventional enterprise computer system 100 that includes computers 102-112 connected together by an intranet 114 as schematically illustrated by arrows 116-126. In the enterprise system 100, resource repositories are personal to each user. For example, resource repositories 128 and 130 are associated with users working on computer 102. Similarly, resource repository 132 is associated with a user operating on computer 104. Resource repositories 134 and 136 are associated with users working on computer 108 and resource repository 138 is associated with a user operating on computer 110. In accordance with the principles of the invention, each repository is assigned a Uniform Resource Identifier (URI) to identify that repository for communication, as described below, with other repositories.
  • Each of repositories 128-138 can retrieve, store and index resources. A more detailed block diagram of a typical repository 200 is shown in FIG. 2. Other repositories in the system would have the same, or a similar, configuration. Repository 200 retrieves information from a resource by means of a connector that is particular to that resource. The information is then added to the repository by adding either a reference to the information or a copy of the content of that information to the repository. If a copy of the information content is retrieved and archived, then the information will still be available if the original source disappears. Several illustrative connectors are shown in FIG. 2 and others would be known to those skilled in the art. These connectors can retrieve information periodically or one time only. For example, connector 202 can fetch a web page 204 located at a designated Uniform Resource Locator (URL) and retrieve the content of that page. The retrieved content is presented to document manager 238 as indicated schematically by arrow 203. Similarly, connector 206 may monitor a particular folder 208 residing on a file system. Any changes in that folder will be retrieved by connector 208 and their contents provided to document manager 238. Alternatively, connector 206 may monitor a particular directory 208 in a file system, retrieving the content of any files placed in the directory. A connector, such as connector 210 may monitor a “syndication feed” 210, such as a Really Simple Syndication (RSS) feed or an Atom feed, retrieving all articles from the feed and presenting the content to document manager 238. In addition, a connector, such as connector 214, can monitor an Internet Message Access Protocol (IMAP) folder so that electronic mail or bulletin board messages added to the folder are retrieved. This monitoring function can also be used where document manager 238 has its own email address.
  • Document manager 238 stores the references or content that it receives from connectors 202, 206, 210 and 214 in an archive 254 as indicated schematically by arrow 256. As mentioned previously, retrieved content can be archived to ensure that any search returns links to resources that can be retrieved from their original location, if the original resource still exists, or from the archive 254, if the original resource does not exist.
  • One of the capabilities of a repository, such as repository 200, is that the incoming resources can be evaluated against classifiers in the repository in order to suggest to the user one or more keywords or categories for each resource that is being placed into the repository. This evaluation generally happens before a user has the opportunity to assign keywords to the resource so that resources that match classifiers can be tagged with appropriate keywords and presented to the user. The user would then typically verify that the system-generated keywords are correct or assign new or additional keywords. The classifiers can be manually built by the user or the classifiers can be automatically and dynamically created and maintained. In particular, in one embodiment discussed in more detail below, incoming resource content can be indexed by indexer/search engine 253 and then provided to a classifier generator 250 in order to provide classifiers for the information in the archive 254.
  • In accordance with this embodiment, incoming resource information that is not automatically classified into any existing category is placed in an unclassified category. Once this category contains a predetermined number of resource information documents, the user is notified and a classifier for that category can be constructed. An exemplary arrangement for automatically generating this classifier is illustrated in FIGS. 3-7. In particular, when the unclassified category contains the predetermined number of documents, the user associated with that repository is notified, for example, by means of a user interface 218.
  • The user can then manually create a new category and rate each document in the unclassified category indicating whether it is relevant or not relevant to the new category. Alternatively, the documents can be provided to an indexer/search engine 253 as schematically indicated by arrow 252. The indexer then indexes the documents. The documents, ratings and indexes are then applied to a classifier generator 250 from the indexer search engine 253 as indicated schematically by arrow 251.
  • FIG. 3 schematically illustrates a conventional process that can be used by classifier generator 250 for automatically building a classifier from a set of “training” documents 300 that have been manually rated as relevant or not relevant for a selected category. In the discussion that follows, text classifiers are used for illustrative purposes. However, the inventive principles can be extended to other types of documents in a straightforward manner using known feature extraction algorithms. Almost all current information retrieval (IR) systems use a “vector space representation” approach for documents. With this approach, each document in the system is represented by a vector in N-space, where N is the number of unique features in the IR system.
  • Generally, features in text documents correspond to text terms, but several text terms may be clustered into a text feature using known techniques. Typically, each vector has multiple components, each of which, in turn, comprises a text feature extracted from the document and a numerical weight associated with that feature. Thus, the first step 302 in generating a classifier is to process the training documents 300 in order to extract the content features associated with the vector representing each document and to assign a numerical weight to each feature. This is a conventional process in which a stripper is used to remove any formatting information and graphics from the document, producing a stream of plain text. Next, a parser parses the plain text stream into words or word combinations are features in the IR system. Some mechanism, such as the frequency of occurrence of the feature in the document, is used to assign a weight to each feature.
  • The result is a vector such as that illustrated in FIG. 4 for each training document. The vector 400 comprises features 402, 404, 406 and 408 that are selected from the features used in the IR system. Each feature 402-408 has an associated weight 410, 412, 414 and 416. Each feature in the system may be represented or, alternatively, features whose weight is zero may be omitted from the vector 400. A vector is generated for each training document to generate a set of vector representations 304.
  • The vector representations 304 are provided to a classifier generation algorithm, of which many are well-known. Two such algorithms are the Rocchio Algorithm described in “Learning Routing Queries in a Query Zone”, A. Singhal, M. Mitra and C. Buckley, SIGIR '97: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, N.Y. (1997) and the K-Nearest Neighbors Algorithm. See, for example, “Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval”, Y. Yang, Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval, Springer Verlag, Heidelberg, Germany (1994). The result is a classifier 308. Classifiers, such as classifier 308, are also generally represented as a weighted feature vector, where each feature is a feature selected from the training documents for the classifier. Such a vector is illustrated in FIG. 5. In vector 500, features 502, 504, 506 and 508 are associated with weights 510, 512, 514 and 516, respectively. In accordance with the principles of the invention, vector 500 also contains a target repository 518 identifier which may be the URI assigned to that repository. As will be described below, after a document manager classifies a document, it will send that document to the repository specified in the target repository identifier. Normally, the target repository would be the repository in which the document manager that classifies the document resides. However, in accordance with the principles of the invention, the identified repository could also be another repository to which the classified document is sent. The new classifier generated by the classifier generator 250 is stored in a classifier store and tree generator 242 as illustrated schematically by arrow 248.
  • With the document and classifier represented as weighted feature vectors, a determination can be made whether a document matches a classifier by comparing the vectors, for example by forming a “dot product” of the vector representing the document with the vector representing the root classifier. The dot product process produces a value and if this dot product value exceeds a threshold which is specific to the classifier, then a match is determined. Otherwise, no match occurs.
  • Weighted feature vectors that represent the classifiers can be treated as special “documents” and processed accordingly. For example, clusters of “classifier documents” that contain similar features can be generated. Clustering can be performed by simply counting the number of features that two classifiers have in common or by computing the similarity of the vector representations using conventional clustering algorithms, such as the single-linkage clustering algorithm and the k-means clustering algorithm. A variety of other conventional algorithms that are suitable for use with the invention are discussed in “Data clustering: a Review”, A. K. Jain, M. N. Murty and P. J. Flynn, ACM Computing Surveys, v. 31, n. 3, pages 264-323, ACM Press, New York, N.Y. 1999. This process is illustrated in FIG. 6 and would be performed by the classifier store and tree generator 242 illustrated in FIG. 2 on classifiers previously stored in the classifier store, including the new classifier added by the user. Classifier store and tree generator 242, might, for example, be configured to re-generate classifier trees each time a new classifier is added to the classifier store 242. The process begins by applying classifiers 600 to 602 to a clustering algorithm 604. The result is that some classifiers will be clustered as indicated by clusters 606 and 608 whereas other classifiers will remain single as illustrated at 610. The clustering algorithm 604 is designed to produce a cluster from pairs of classifiers so that a binary tree will result when the process is finished.
  • A cluster of classifiers, such as clusters 606 and 608 can be considered a set of training documents and a higher-level classifier, or meta-classifier, can be generated from the cluster of classifier documents using the same classifier generation process illustrated in FIG. 3. Thus, classifier cluster 606 can be applied to classifier generator 612 to generate meta-classifier 616. Similarly, classifier cluster 608 can be applied to generator 614 to generate meta-classifier 618.
  • If more than a single classifier remains, meta-classifiers, such as meta- classifiers 616 and 618, together with unclustered classifiers 610 can then be clustered, via a clustering algorithm 620, which can be the same algorithm as algorithm 604 or a different algorithm. Again, the clustering algorithm is designed to produce a cluster from two meta-classifiers. The result is a cluster of meta-classifiers of which cluster 622 is illustrated. The meta-classifier clusters can again be applied to a classifier generator 626, which can be the same as the generators 612 and 614, or different, to generate a meta-meta-classifier 628. The meta-meta-classifiers together with any un-clustered classifiers 624 are, in turn applied to another clustering algorithm 630. This process is repeated until a single root classifier 632 is generated.
  • The result of the process illustrated in FIG. 6 is a binary tree hierarchy (or possibly a small forest) of classifiers built from classifiers (which were built from classifiers . . . ) as shown in FIG. 7. The binary tree 700 has root classifier 702 node as its highest level. The tree 700 is constructed so that each node has two nodes that can be selected. Thus, once the classifier associated with a node indicates a match for a new document, at most two other classifiers must be checked for a match. For example, if root classifier 702 matches a document, then, at most, meta- classifiers 704 and 706 need be checked for matches. Since the meta-classifiers are produced by a clustering algorithm, if meta-classifier 704 indicates a match, then classifier 706 need not be checked. Instead, meta- classifiers 708 and 710 are checked. However, if meta-classifier 704 does not match a new document, then meta-classifier 706 is checked for a match. If a match is obtained with meta-classifier 706, then one or both of meta- classifiers 712 and 714 are checked.
  • Assuming a match is obtained at one of the two nodes at a tree level, this process proceeds through each level 716 and 718 of the tree until the lowest or leaf level 720 is reached. In level 720, all of the nodes, of which nodes 722-732 are shown, consist of single classifiers. Alternatively, if neither of the meta-classifiers in the nodes at a given level produces a match, then the meta-classifier associated with the node in the previous tree level that did produce a match is used to place the document in a category. For example, if a match is obtained with the meta-classifier in node 704, but neither the classifier 708 nor the meta-classifier 710 produces a match, then meta-classifier 704 is used to categorize the document.
  • The classifier trees generated by the classifier store and tree generator 242 are provided to the document manager 238 as indicated schematically by arrow 246 for use by the document manager 238 in classifying incoming resource content.
  • In accordance with the principles of the invention, a user may submit classifiers that have been generated and stored in the classifier store 242 to other repositories. In particular, each repository maintains and publishes a web server, such as web server 236, to which other repositories can submit classifiers. If a first user has built a classifier for their own personal resources, this classifier could be communicated to a second user's repository. For example, via the user interface 218, the user can select a classifier and control the classifier store and tree generator 242 to transfer the selected classifier to the web interface 226 as indicated schematically by arrow 224.
  • This process is illustrated in more detail in FIG. 8 and the steps in the process are illustrated in the flowchart shown in FIG. 9. In particular, the process starts in step 900 and proceeds to step 902 where a first user associated with repository 1 (800) locates the web server 820 in repository 802 to which the first user desires to submit a classifier. The process then proceeds to step 904 where a determination is made whether additional classifiers will be submitted to repository 802.
  • If further classifiers remain to be submitted, as determined in step 904, then, in step 908, the next classifier to be transferred is selected, for example, classifier 804. As previously mentioned, classifier 804 is provided to web interface 808 in repository 800 as indicated by arrow 806. Then, in step 910, the classifier is submitted to the web server 816. In particular, web interface 808 transfers the classifier to the location of web server 820, as previously determined, as indicated schematically by arrow 810, for example, via the Internet 812 or some other network, to the web interface 816 as indicated by arrow 814. The process then returns to step 904 to determine whether further classifiers remain to be submitted. If so, processing continues in the manner discussed above. If no further classifiers remain to be submitted, then the process terminates in step 906.
  • Referring to FIG. 2, a classifier arriving at web interface 226 from another repository is provided to the web server 236 as indicated by arrow 222. The web server 236 can then enter that classifier into the classifier store 242 as indicated by arrow 240. The web server 236 may also trigger the classifier store and tree generator 242 to regenerate the classifier trees from the contents of the classifier store including the classifier received from the other repository. Alternatively, the classifier store and tree generator 242 can regenerate the classifier trees on a predetermined schedule. In FIG. 8, classifiers arriving at web interface 816 are provided to web server 820 as indicated by arrow 818. From web server 820, the classifiers are entered into the classifier store 824 as indicated by arrow 822.
  • After the classifier trees in the classifier store 824 have been rebuilt, when new resource content is added to repository 802, it is classified by document manager 826 using the classifier trees that include the classifiers that were submitted by the first user.
  • The process of using a classifier tree to place a new document, representing new content from a resource into a category is shown in FIGS. 10 and 11. This process starts in step 1100 and proceeds to step 1102 where a new document 1000 is parsed to identify and weigh important content features 1002 in a conventional manner. In step 1104, these features are used to represent the new document as a vector 1004 also in a conventional manner. The vector representation 1004 is then applied to the classification tree 1006 that is constructed as described above. The process of using the classification tree 1006 is shown in steps 1106-1114. The process starts at the root classifier of the tree in order to check whether the new document should be classified into the given tree. In particular, in step 1106, a determination is made whether the root classifier generates a match. If no match occurs, the process proceeds to finish in step 1116 because the classification tree is not applicable to the new document. Other classification trees may then be used.
  • Alternatively, if in step 1106, the root classifier generates a match, then, in step 1108, the next lower level in the classification tree is selected. Since the tree is a binary tree, this next level will have two meta-classifiers. In step 1110, the “left” classifier of the two classifiers is checked. If there is a match, the “right” classifier need not be checked. Instead, the process proceeds back to step 1108 to select the next lower level of the classification tree that branches from the currently selected node. Thus, the search process proceeds in a “depth-first” manner.
  • If, in step 1110, no match occurs, then, in step 1112, the “right” classifier of the meta-classifier pair is checked for a match. If a match occurs, the process returns to step 1108 where the next lowest level of the classification tree from the selected node is now selected and the process repeated.
  • If no match occurs at step 1112, then the lowest level of the tree at which the classifiers are applicable has been reached. At this point, a cluster of classifiers against which the new document should be evaluated is determined by the matching meta-classifier of the previous level. The classifiers in this cluster are then evaluated to determine the category or categories in which the document will be placed, as set forth in step 1114. The process then finishes in step 1116. The result is a document category or categories 1008.
  • If new resource content matches a classifier, the document manager processing that resource content sends the content to an archive in the target repository identified by that classifier. In most cases, the target repository will be the repository in which the document manager is located. Thus, for example, in FIG. 2, if document manager 238 determines that the target repository is repository 200, then the document manager 238 sends the content to archive 254 as indicated by arrow 256.
  • However, in accordance with the principles of the invention, if the target repository is a different repository, then document manager 238 provides the document content to web interface 226 as indicated by arrow 244. The content is then sent to the other repository identified by, for example, a URI specified in the classifier. Each repository monitors its identified address to check for the arrival of new content. For example, repository 200 has a connector 232 that monitors its URI as indicated schematically by 234. When new content arrives at the URI, connector 232 transfers this content to document manager 238 as indicated by arrow 230 in the same manner that connectors 202-214 transfer content to document manager 238.
  • When new content arrives at document manager 238, the content is indexed and stored in archive 254 in the manner discussed above. This mechanism provides for a group information discovery mechanism. When a first user finds content that would interest a second user, the content is automatically placed in the second user's repository.
  • In order to alert the second user that relevant content has been identified, the document manager that identifies the content can alert the second user. For example, as illustrated in FIG. 8, upon identifying content that matches a classifier submitted by repository 800, document manager 832 can instruct notifier 828 as indicated by arrow 834 to send a notification to repository 800 via web interface 816. Web interface 816 sends the notification as indicated schematically by arrow 836, Internet 812 and arrow 838 to web interface 808. Alternatively, document manager 826 can post the notification to public distribution area, such as conventional syndication feed 830. Repository 800 can then poll the syndication feed 830, via web interfaces 808 and 816, to discover when new content has been added to repository 800.
  • In another embodiment, instead of actually transferring content between repositories, a first user associated with a first repository could request that a second repository associated with a second user simply notify the first repository whenever the second repository adds new resource content to a particular category in the second repository. This could be accomplished by the first user submitting a classifier to the second repository, which classifier identifies content for the category. In this manner the first user is notified about anything that the second user finds interesting in a particular category.
  • Similarly, in yet another embodiment the inventive system can be used in situations where a user works on materials that, for security reasons, he or she cannot allow anyone else to see. This user may have spent a great deal of effort on the construction of a classifier for a particular class of documents. He could decide to share his classifier for this class of documents without having to share the documents from which the classifier was constructed. This can be accomplished by making a copy of the classifier and changing the target repository identifier to identify a repository with which the user wishes to share information. The user then instructs the document manager to notify the other repository when relevant information is identified, but not to forward the content to the other repository. Alternatively, a user can simply forward a classifier that he or she has constructed to another repository for use in that repository. Further, a user can also forward an entire classifier tree, or trees, to another user to enable that other user to classify content in the same manner as the first user.
  • A software implementation of the above-described embodiment may comprise a series of computer instructions either fixed on a tangible medium, such as a computer readable media, for example, a diskette, a CD-ROM, a ROM memory, or a fixed disk, or transmittable to a computer system, via a modem or other interface device over a medium. The medium either can be a tangible medium, including but not limited to optical or analog communications lines, or may be implemented with wireless techniques, including but not limited to microwave, infrared or other transmission techniques. It may also be the Internet. The series of computer instructions embodies all or part of the functionality previously described herein with respect to the invention. Those skilled in the art will appreciate that such computer instructions can be written in a number of programming languages for use with many computer architectures or operating systems. Further, such instructions may be stored using any memory technology, present or future, including, but not limited to, semiconductor, magnetic, optical or other memory devices, or transmitted using any communications technology, present or future, including but not limited to optical, infrared, microwave, or other transmission technologies. It is contemplated that such a computer program product may be distributed as a removable media with accompanying printed or electronic documentation, e.g., shrink wrapped software, pre-loaded with a computer system, e.g., on system ROM or fixed disk, or distributed from a server or electronic bulletin board over a network, e.g., the Internet or World Wide Web.
  • Although an exemplary embodiment of the invention has been disclosed, it will be apparent to those skilled in the art that various changes and modifications can be made which will achieve some of the advantages of the invention without departing from the spirit and scope of the invention. For example, it will be obvious to those reasonably skilled in the art that, in other implementations, the inventive method and apparatus can be used with any conventional clustering algorithms and algorithms for generating classifiers. The order of the process steps may also be changed without affecting the operation of the invention. Other aspects, such as the specific process flow, as well as other modifications to the inventive concept are intended to be covered by the appended claims.

Claims (20)

1. A method for searching and resource discovery in a distributed enterprise in which users create personal data repositories and generate classifiers that classify the information for those repositories, the method comprising:
(a) associating with each generated classifier a target repository into which a resource that matches that classifier is added;
(b) submitting a classifier that was created in one repository to at least one other repository;
(c) when a resource is added to a repository, checking the added resource against all classifiers that have been generated for that repository and against all classifiers submitted to that repository by other repositories; and
(d) when the added resource matches any classifier, adding the resource to a repository associated with the matching classifier.
2. The method of claim 1 further comprising:
(e) when a copy of a resource is received at a repository, indexing that resource copy.
3. The method of claim 1 wherein step (d) further comprises notifying a repository associated with the matching classifier that a resource has been added to that repository.
4. The method of claim 1 wherein step (d) further comprises posting a notification to a public distribution area when a resource has been added to the repository associated with the matching classifier so that repository can poll the public distribution area to determine when resources have been added to that repository.
5. The method of claim 1 wherein step (a) comprises generating a target repository identifier for a classifier and inserting the target repository identifier into that classifier.
6. The method of claim 5 wherein each repository is assigned a uniform resource identifier (URI) and the target repository identifier for a repository is the URI assigned to that repository.
7. The method of claim 1 wherein each repository publishes a web server that receives classifiers submitted by other repositories and wherein step (b) comprises submitting a classifier that was created in one repository to a web server published by at least one other repository.
8. The method of claim 1 wherein step (c) comprises generating a classifier tree for a repository from all classifiers that have been generated for that repository and for all classifiers submitted to that repository by other repositories and comparing the added resource to the classifier tree.
9. The method of claim 1 wherein in step (d) when the added resource matches any classifier, notifying the repository associated with the matching classifier instead of adding the resource to a repository associated with the matching classifier.
10. The method of claim 1 further comprising:
(e) associating with at least one generated classifier a target repository into which a copy of a resource that matches that classifier is placed wherein the target repository is a repository other than the repository that generated the classifier.
11. Apparatus for searching and resource discovery in a distributed enterprise in which users create personal data repositories and generate classifiers that classify the information for those repositories, the apparatus comprising:
a mechanism that associates with each generated classifier a target repository into which a resource that matches that classifier is added;
a mechanism that submits a classifier that was created in one repository to at least one other repository;
a mechanism operable when a resource is added to a repository, that checks the added resource against all classifiers that have been generated for that repository and against all classifiers submitted to that repository by other repositories; and
a mechanism operable when the added resource matches any classifier, that adds the resource to a repository associated with the matching classifier.
12. The apparatus of claim 11 further comprising a mechanism operable when a copy of a resource is received at a repository, that indexes that resource copy.
13. The apparatus of claim 11 wherein the mechanism that adds the resource to a repository further comprises a mechanism that notifies a repository associated with the matching classifier that a resource has been added to that repository.
14. The apparatus of claim 11 wherein the mechanism that adds the resource to a repository further comprises a mechanism that posts a notification to a public distribution area when a resource has been added to the repository associated with the matching classifier so that repository can poll the public distribution area to determine when resources have been added to that repository.
15. The apparatus of claim 11 wherein the mechanism that adds the resource to a repository comprises a mechanism that generates a target repository identifier for a classifier and inserts the target repository identifier into that classifier.
16. The apparatus of claim 15 wherein each repository is assigned a uniform resource identifier (URI) and the target repository identifier for a repository is the URI assigned to that repository.
17. The apparatus of claim 11 wherein each repository publishes a web server that receives classifiers submitted by other repositories and wherein the mechanism that submits a classifier that was created in one repository to at least one other repository comprises a mechanism that submits a classifier that was created in one repository to a web server published by at least one other repository.
18. The apparatus of claim 11 wherein the mechanism that checks the added resource against all classifiers that have been generated for that repository comprises a mechanism that generates a classifier tree for a repository from all classifiers that have been generated for that repository and for all classifiers submitted to that repository by other repositories and compares the added resource to the classifier tree.
19. The apparatus of claim 11 wherein the mechanism that adds the resource to a repository comprises a mechanism operable when the added resource matches any classifier, that notifies the repository associated with the matching classifier instead of adding the resource to a repository associated with the matching classifier.
20. Apparatus for searching and resource discovery in a distributed enterprise in which users create personal data repositories and generate classifiers that classify the information for those repositories, the apparatus comprising:
means for associating with each generated classifier a target repository into which a resource that matches that classifier is added;
means for submitting a classifier that was created in one repository to at least one other repository;
means, operable when a resource is added to a repository, for checking the added resource against all classifiers that have been generated for that repository and against all classifiers submitted to that repository by other repositories; and
means, operable when the added resource matches any classifier, for adding the resource to a repository associated with the matching classifier.
US11/477,021 2006-06-28 2006-06-28 Method and apparatus for searching and resource discovery in a distributed enterprise system Abandoned US20080005081A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/477,021 US20080005081A1 (en) 2006-06-28 2006-06-28 Method and apparatus for searching and resource discovery in a distributed enterprise system
US12/557,403 US7949660B2 (en) 2006-06-28 2009-09-10 Method and apparatus for searching and resource discovery in a distributed enterprise system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/477,021 US20080005081A1 (en) 2006-06-28 2006-06-28 Method and apparatus for searching and resource discovery in a distributed enterprise system

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/557,403 Continuation US7949660B2 (en) 2006-06-28 2009-09-10 Method and apparatus for searching and resource discovery in a distributed enterprise system

Publications (1)

Publication Number Publication Date
US20080005081A1 true US20080005081A1 (en) 2008-01-03

Family

ID=38877942

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/477,021 Abandoned US20080005081A1 (en) 2006-06-28 2006-06-28 Method and apparatus for searching and resource discovery in a distributed enterprise system
US12/557,403 Active US7949660B2 (en) 2006-06-28 2009-09-10 Method and apparatus for searching and resource discovery in a distributed enterprise system

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/557,403 Active US7949660B2 (en) 2006-06-28 2009-09-10 Method and apparatus for searching and resource discovery in a distributed enterprise system

Country Status (1)

Country Link
US (2) US20080005081A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080147630A1 (en) * 2006-10-27 2008-06-19 Kaiyi Chu Recommender and payment methods for recruitment
US20110029527A1 (en) * 2009-07-28 2011-02-03 Knight William C System And Method For Displaying Relationships Between Electronically Stored Information To Provide Classification Suggestions Via Nearest Neighbor
US20110126093A1 (en) * 2006-11-06 2011-05-26 Microsoft Corporation Clipboard augmentation with references
US8341177B1 (en) * 2006-12-28 2012-12-25 Symantec Operating Corporation Automated dereferencing of electronic communications for archival
US8612446B2 (en) 2009-08-24 2013-12-17 Fti Consulting, Inc. System and method for generating a reference set for use during document review
US8751442B2 (en) 2007-02-12 2014-06-10 Microsoft Corporation Synchronization associated duplicate data resolution
US9203786B2 (en) 2006-06-16 2015-12-01 Microsoft Technology Licensing, Llc Data synchronization and sharing relationships
US11068546B2 (en) 2016-06-02 2021-07-20 Nuix North America Inc. Computer-implemented system and method for analyzing clusters of coded documents
US20210248429A1 (en) * 2016-09-16 2021-08-12 Technische Universitaet Dresden Method for classifying spectra of objects having complex information content

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7249034B2 (en) 2002-01-14 2007-07-24 International Business Machines Corporation System and method for publishing a person's affinities
US9137417B2 (en) 2005-03-24 2015-09-15 Kofax, Inc. Systems and methods for processing video data
US9769354B2 (en) 2005-03-24 2017-09-19 Kofax, Inc. Systems and methods of processing scanned data
US9576272B2 (en) 2009-02-10 2017-02-21 Kofax, Inc. Systems, methods and computer program products for determining document validity
US8774516B2 (en) 2009-02-10 2014-07-08 Kofax, Inc. Systems, methods and computer program products for determining document validity
US9767354B2 (en) 2009-02-10 2017-09-19 Kofax, Inc. Global geographic information retrieval, validation, and normalization
US9349046B2 (en) 2009-02-10 2016-05-24 Kofax, Inc. Smart optical input/output (I/O) extension for context-dependent workflows
US8958605B2 (en) 2009-02-10 2015-02-17 Kofax, Inc. Systems, methods and computer program products for determining document validity
US20120089774A1 (en) * 2010-10-12 2012-04-12 International Business Machines Corporation Method and system for mitigating adjacent track erasure in hard disk drives
US8572315B2 (en) 2010-11-05 2013-10-29 International Business Machines Corporation Smart optimization of tracks for cloud computing
US9058515B1 (en) 2012-01-12 2015-06-16 Kofax, Inc. Systems and methods for identification document processing and business workflow integration
US9058580B1 (en) * 2012-01-12 2015-06-16 Kofax, Inc. Systems and methods for identification document processing and business workflow integration
US8855375B2 (en) 2012-01-12 2014-10-07 Kofax, Inc. Systems and methods for mobile image capture and processing
US10146795B2 (en) 2012-01-12 2018-12-04 Kofax, Inc. Systems and methods for mobile image capture and processing
US9483794B2 (en) * 2012-01-12 2016-11-01 Kofax, Inc. Systems and methods for identification document processing and business workflow integration
US9449004B2 (en) * 2012-03-15 2016-09-20 Sap Se File repository abstraction layer
JP2016517587A (en) 2013-03-13 2016-06-16 コファックス, インコーポレイテッド Classification of objects in digital images captured using mobile devices
US9208536B2 (en) 2013-09-27 2015-12-08 Kofax, Inc. Systems and methods for three dimensional geometric reconstruction of captured image data
US9355312B2 (en) 2013-03-13 2016-05-31 Kofax, Inc. Systems and methods for classifying objects in digital images captured using mobile devices
US20140316841A1 (en) 2013-04-23 2014-10-23 Kofax, Inc. Location-based workflows and services
JP2016518790A (en) 2013-05-03 2016-06-23 コファックス, インコーポレイテッド System and method for detecting and classifying objects in video captured using a mobile device
WO2015073920A1 (en) 2013-11-15 2015-05-21 Kofax, Inc. Systems and methods for generating composite images of long documents using mobile video data
JP2017514211A (en) * 2014-03-19 2017-06-01 コファックス, インコーポレイテッド System and method for identification document processing and business workflow integration
WO2016032500A1 (en) 2014-08-29 2016-03-03 Hewlett Packard Enterprise Development Lp Resource trees by management controller
US9820138B2 (en) * 2014-10-22 2017-11-14 At&T Intellectual Property I, L.P. Method and apparatus for resource management in a communication system
US9760788B2 (en) 2014-10-30 2017-09-12 Kofax, Inc. Mobile document detection and orientation based on reference object characteristics
US10242285B2 (en) 2015-07-20 2019-03-26 Kofax, Inc. Iterative recognition-guided thresholding and data extraction
US9779296B1 (en) 2016-04-01 2017-10-03 Kofax, Inc. Content-based detection and three dimensional geometric reconstruction of objects in image and video data
US11681942B2 (en) 2016-10-27 2023-06-20 Dropbox, Inc. Providing intelligent file name suggestions
US9852377B1 (en) * 2016-11-10 2017-12-26 Dropbox, Inc. Providing intelligent storage location suggestions
US10803350B2 (en) 2017-11-30 2020-10-13 Kofax, Inc. Object detection and image cropping using a multi-detector approach

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002967A1 (en) * 2002-03-28 2004-01-01 Rosenblum David S. Method and apparatus for implementing query-response interactions in a publish-subscribe network
US20050086205A1 (en) * 2003-10-15 2005-04-21 Xerox Corporation System and method for performing electronic information retrieval using keywords
US20050228774A1 (en) * 2004-04-12 2005-10-13 Christopher Ronnewinkel Content analysis using categorization
US20060265746A1 (en) * 2001-04-27 2006-11-23 Internet Security Systems, Inc. Method and system for managing computer security information
US7275063B2 (en) * 2002-07-16 2007-09-25 Horn Bruce L Computer system for automatic organization, indexing and viewing of information from multiple sources

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7185081B1 (en) * 1999-04-30 2007-02-27 Pmc-Sierra, Inc. Method and apparatus for programmable lexical packet classifier
US7181438B1 (en) * 1999-07-21 2007-02-20 Alberti Anemometer, Llc Database access system
US7249034B2 (en) * 2002-01-14 2007-07-24 International Business Machines Corporation System and method for publishing a person's affinities
US7958088B2 (en) * 2007-12-14 2011-06-07 Yahoo! Inc. Dynamic data reorganization to accommodate growth across replicated databases

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060265746A1 (en) * 2001-04-27 2006-11-23 Internet Security Systems, Inc. Method and system for managing computer security information
US20040002967A1 (en) * 2002-03-28 2004-01-01 Rosenblum David S. Method and apparatus for implementing query-response interactions in a publish-subscribe network
US7275063B2 (en) * 2002-07-16 2007-09-25 Horn Bruce L Computer system for automatic organization, indexing and viewing of information from multiple sources
US20050086205A1 (en) * 2003-10-15 2005-04-21 Xerox Corporation System and method for performing electronic information retrieval using keywords
US20050228774A1 (en) * 2004-04-12 2005-10-13 Christopher Ronnewinkel Content analysis using categorization

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9203786B2 (en) 2006-06-16 2015-12-01 Microsoft Technology Licensing, Llc Data synchronization and sharing relationships
US20080147630A1 (en) * 2006-10-27 2008-06-19 Kaiyi Chu Recommender and payment methods for recruitment
US10572582B2 (en) * 2006-11-06 2020-02-25 Microsoft Technology Licensing, Llc Clipboard augmentation with references
US20170329751A1 (en) * 2006-11-06 2017-11-16 Microsoft Technology Licensing, Llc Clipboard augmentation with references
US20110126093A1 (en) * 2006-11-06 2011-05-26 Microsoft Corporation Clipboard augmentation with references
US9747266B2 (en) * 2006-11-06 2017-08-29 Microsoft Technology Licensing, Llc Clipboard augmentation with references
US20130262972A1 (en) * 2006-11-06 2013-10-03 Microsoft Corporation Clipboard augmentation with references
US8341177B1 (en) * 2006-12-28 2012-12-25 Symantec Operating Corporation Automated dereferencing of electronic communications for archival
US8751442B2 (en) 2007-02-12 2014-06-10 Microsoft Corporation Synchronization associated duplicate data resolution
US9064008B2 (en) 2009-07-28 2015-06-23 Fti Consulting, Inc. Computer-implemented system and method for displaying visual classification suggestions for concepts
US9477751B2 (en) 2009-07-28 2016-10-25 Fti Consulting, Inc. System and method for displaying relationships between concepts to provide classification suggestions via injection
US8645378B2 (en) 2009-07-28 2014-02-04 Fti Consulting, Inc. System and method for displaying relationships between concepts to provide classification suggestions via nearest neighbor
US8700627B2 (en) 2009-07-28 2014-04-15 Fti Consulting, Inc. System and method for displaying relationships between concepts to provide classification suggestions via inclusion
US8713018B2 (en) 2009-07-28 2014-04-29 Fti Consulting, Inc. System and method for displaying relationships between electronically stored information to provide classification suggestions via inclusion
US20110029527A1 (en) * 2009-07-28 2011-02-03 Knight William C System And Method For Displaying Relationships Between Electronically Stored Information To Provide Classification Suggestions Via Nearest Neighbor
US8909647B2 (en) 2009-07-28 2014-12-09 Fti Consulting, Inc. System and method for providing classification suggestions using document injection
US8572084B2 (en) * 2009-07-28 2013-10-29 Fti Consulting, Inc. System and method for displaying relationships between electronically stored information to provide classification suggestions via nearest neighbor
US9165062B2 (en) 2009-07-28 2015-10-20 Fti Consulting, Inc. Computer-implemented system and method for visual document classification
US8515957B2 (en) 2009-07-28 2013-08-20 Fti Consulting, Inc. System and method for displaying relationships between electronically stored information to provide classification suggestions via injection
US10083396B2 (en) 2009-07-28 2018-09-25 Fti Consulting, Inc. Computer-implemented system and method for assigning concept classification suggestions
US9898526B2 (en) 2009-07-28 2018-02-20 Fti Consulting, Inc. Computer-implemented system and method for inclusion-based electronically stored information item cluster visual representation
US9336303B2 (en) 2009-07-28 2016-05-10 Fti Consulting, Inc. Computer-implemented system and method for providing visual suggestions for cluster classification
US8635223B2 (en) 2009-07-28 2014-01-21 Fti Consulting, Inc. System and method for providing a classification suggestion for electronically stored information
US20110029530A1 (en) * 2009-07-28 2011-02-03 Knight William C System And Method For Displaying Relationships Between Concepts To Provide Classification Suggestions Via Injection
US9542483B2 (en) 2009-07-28 2017-01-10 Fti Consulting, Inc. Computer-implemented system and method for visually suggesting classification for inclusion-based cluster spines
US9679049B2 (en) 2009-07-28 2017-06-13 Fti Consulting, Inc. System and method for providing visual suggestions for document classification via injection
US8515958B2 (en) 2009-07-28 2013-08-20 Fti Consulting, Inc. System and method for providing a classification suggestion for concepts
US9489446B2 (en) 2009-08-24 2016-11-08 Fti Consulting, Inc. Computer-implemented system and method for generating a training set for use during document review
US9336496B2 (en) 2009-08-24 2016-05-10 Fti Consulting, Inc. Computer-implemented system and method for generating a reference set via clustering
US9275344B2 (en) 2009-08-24 2016-03-01 Fti Consulting, Inc. Computer-implemented system and method for generating a reference set via seed documents
US10332007B2 (en) 2009-08-24 2019-06-25 Nuix North America Inc. Computer-implemented system and method for generating document training sets
US8612446B2 (en) 2009-08-24 2013-12-17 Fti Consulting, Inc. System and method for generating a reference set for use during document review
US11068546B2 (en) 2016-06-02 2021-07-20 Nuix North America Inc. Computer-implemented system and method for analyzing clusters of coded documents
US20210248429A1 (en) * 2016-09-16 2021-08-12 Technische Universitaet Dresden Method for classifying spectra of objects having complex information content
US11879778B2 (en) * 2016-09-16 2024-01-23 Technische Universität Dresden Method for classifying spectra of objects having complex information content

Also Published As

Publication number Publication date
US20090327250A1 (en) 2009-12-31
US7949660B2 (en) 2011-05-24

Similar Documents

Publication Publication Date Title
US7949660B2 (en) Method and apparatus for searching and resource discovery in a distributed enterprise system
US11809432B2 (en) Knowledge gathering system based on user's affinity
US8688673B2 (en) System for communication and collaboration
US7809710B2 (en) System and method for extracting content for submission to a search engine
EP1018086B1 (en) Search system and method based on multiple ontologies
US6260041B1 (en) Apparatus and method of implementing fast internet real-time search technology (first)
US20040187075A1 (en) Document management apparatus, system and method
WO2001027793A2 (en) Indexing a network with agents
EP1428138A2 (en) Indexing a network with agents
US20050086252A1 (en) Method and apparatus for creating an information security policy based on a pre-configured template
JP2010539589A (en) Identifying information related to specific entities from electronic sources
US20100017388A1 (en) Systems and methods for performing a multi-step constrained search
CN113297457B (en) High-precision intelligent information resource pushing system and pushing method
WO2001027805A2 (en) Index cards on network hosts for searching, rating, and ranking
US8661069B1 (en) Predictive-based clustering with representative redirect targets
US7836108B1 (en) Clustering by previous representative
JP2003186888A (en) Component information classification device, component information search device, and component information search server
Tarakeswar et al. Search engines: a study
EP1929410B1 (en) A method and system for searching for people or items by keywords
Zhang et al. Web taxonomy integration using spectral graph transducer
Aluja et al. Enhancing Socioeconomic Surveys by Data about Internet Usage
HK1029412B (en) Search system and method based on multiple ontologies
WO2001035281A1 (en) Content engine

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION