Dental Implant Technology Clustering and Technology Life-span Analysis Using Ontology- based Patent Intelligence Master of Science Thesis in the Master Degree Programme, Biotechnology SEAN LONG HOANG Department of Chemical and Biological Engineering Division of Applied Surface Chemistry CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden, 2012 Title: Dental Implant Technology Clustering and Technology Life-span Analysis Using Ontology-based Patent Intelligence Author: Sean Long Hoang © SEAN LONG HOANG, 2012 Department of Applied Surface Chemistry Chalmers University of Technology SE-412 96 Göteborg Sweden Supervisor: Dr. Charles V. Trappey Department of Management Science National Chiao Tung University 300, Hsinchu Taiwan Examiner: Professor Krister Holmberg Department of Chemical and Biological Engineering Division of Applied Surface Chemistry SE-412 96 Göteborg Sweden Göteborg, Sweden 2012 Dental Implant Technology Clustering and Technology Life-span Analysis Using Ontology- based Patent Intelligence SEAN LONG HOANG Department of Chemical and Biological Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden i Dental Implant Technology Clustering and Technology Life-span Analysis Using Ontology-based Patent Intelligence Student: Sean Long Hoang Advisor: Dr. Charles V. Trappey National Chiao Tung University Examiner: Prof. Krister Holmberg Chalmers University of Technology ABSTRACT Rapid technology development shorter product life cycle, fierce competition in the marketplace, establishes patent analyses are an important strategic tool for R&D management. This thesis develops a technology clustering and life-span analysis framework based on data mining techniques to help companies effectively and rapidly gain domain-specific knowledge and technology insight. In addition, patent documents contain complex terminologies which require experts to perform patent analysis. This research applies patent analysis methodologies to create domain-specific ontologies. The advantage of using an ontology is that it contains specific domain concepts and helps researchers to understand the relationships between concepts. In addition, ontologies are used to effectively extract domain knowledge, cluster patents and create graphs for tend recognition. Life-span analysis of technology clusters helps companies to gain a quick snapshot of their own patent portfolio and identify potential technology clusters for investment. This thesis proposes the process of knowledge extraction for domain-specific patents using patent analysis methodologies which improve domain-knowledge understanding. The methodologies proposed in this research include key phrase analysis, patent technology clustering, patent document clustering, domain-specific ontology, and life-span analysis. With these methodologies, companies quickly derive domain-specific ontologies to help R&D engineers relate data and increase understanding of a specific domain and the relationship between concepts. Life-span analysis helps companies’ direct strategic R&D plans and evaluates the timing of investments using the methodologies proposed in this research. The validity and reliability of the methodology are tested by studying the application of a set of dental implant patents. Keywords: Dental implant, ontology, key phrase analysis, clustering, life-span analysis ii ACKNOWLEGEMENTS I would like to thank my advisor Dr. Charles V. Trappey for guidance, advices, and helping me to excel to a higher level. I deeply appreciate his sharing of knowledge and experiences, and moreover his fun stories were always enjoyable. I would also like to thank a very kind person, Dr. Chun-Yi Wu, for his guidance, time, and effort throughout my time working on this thesis. Without the use of the IPDSS software, developed by Dr. Wu, this thesis would not have been possible to execute. I would also like to thank my family and friends! My time in Taiwan has been wonderful thanks to your kindness and hospitality. “We make a living by what we get, but we make a life by what we give” By Sir Winston Churchill Sean Long Hoang Chalmers University of Technology Hsinchu, Taiwan December 2011 iii TABLE OF CONTENTS ABSTRACT .............................................................................................................................. I ACKNOWLEGEMENTS ....................................................................................................... II TABLE OF CONTENTS ...................................................................................................... III LIST OF TABLES ................................................................................................................... V LIST OF FIGURES ............................................................................................................... VI 1. INTRODUCTION ................................................................................................................ 1 1.1. Research background ............................................................................................ 1 1.2. Research motivation ............................................................................................. 2 1.3. Research procedure............................................................................................... 3 1.4. Research objectives .............................................................................................. 4 1.5. Research limits ..................................................................................................... 4 2. BACKGROUND ................................................................................................................... 5 2.1. Dental implant ...................................................................................................... 5 2.2. Data mining .......................................................................................................... 6 2.2.1. Text mining ....................................................................................................... 8 2.2.2. Limitations in text mining ................................................................................. 9 2.3. Ontology ............................................................................................................... 9 2.4. Patent analysis .................................................................................................... 12 2.4.1. Data in patent documents ................................................................................ 15 2.4.2. Limitations in patent analysis ......................................................................... 16 2.5. Patent clustering ................................................................................................. 16 2.6. Key phrase analysis ............................................................................................ 18 2.6.1. Term frequency approach ............................................................................... 19 2.6.2. Key phrase correlation matrix ......................................................................... 21 2.6.3. Key phrase and patent correlation matrix ....................................................... 22 2.6.4. Limitations in key phrases extraction of textual documents ........................... 23 2.7. Technology life cycle analysis ........................................................................... 23 2.8. Research framework ........................................................................................... 25 3. METHODOLOGY ............................................................................................................. 26 iv 3.1. Patent domain definition ..................................................................................... 27 3.2. Key phrase analysis process ............................................................................... 28 3.4. Processing domain-specific ontologies .............................................................. 32 3.5. Processing life-span analysis in patent clusters .................................................. 40 4. CASE STUDY AND ANALYSIS ...................................................................................... 43 4.1. Dental patent documents samples....................................................................... 43 4.2. Dental implant technology key phrase analysis processing results .................... 43 4.3. Dental implant technology key phrases .............................................................. 47 4.4. Dental technology ontology ............................................................................... 48 4.5. Ontology-based technology clustering of dental implant patents ...................... 53 5. DISCUSSION AND CONCLUSION ................................................................................ 61 5.1. Discussion ........................................................................................................... 61 5.2. Conclusion .......................................................................................................... 64 5.3. Future research suggestions ................................................................................ 65 REFERENCES ....................................................................................................................... 66 APPENDIX 1 Key phrase and patent correlation matrix ................................................. 71 APPENDIX 2 Sub-domain key phrase matrix ................................................................... 72 APPENDIX 3 Validation and modification of ontology .................................................... 74 v LIST OF TABLES Table 1. Description of a typical patent analysis scenario ....................................................... 14 Table 2. Definition of patent classifications for dental implants .............................................. 14 Table 3. Key phrases and patent correlation matrix ................................................................. 18 Table 4. Key phrases correlation matrix ................................................................................... 21 Table 5. Key phrases and patent correlation matrix ................................................................. 22 Table 6. Methodology outline of this research ......................................................................... 25 Table 7. Key phrases and patent correlation matrix ................................................................. 30 Table 8. Key phrases correlation matrix ................................................................................... 32 Table 9. Patent and UPC matrix ............................................................................................... 34 Table 10. Key phrase and UPC matrix ..................................................................................... 34 Table 11. Key phrase organization matrix ................................................................................ 35 Table 12. Key phrases and patent correlation matrix ............................................................... 38 Table 13. Key phrases of each cluster ...................................................................................... 38 Table 14. Patent information and average age ......................................................................... 42 Table 15. List of patents in each classification and each dimension ........................................ 44 Table 16. Part of dental implant key phrase and patent correlation matrix .............................. 45 Table 17. Part of dental implant sub-domain key phrase matrix .............................................. 46 Table 18. Part of key phrase matrix for each dimension .......................................................... 47 Table 19. Key phrases for improvement of implant ontology .................................................. 50 Table 20. List of phrases for ontological sub-domains of test patents ..................................... 53 Table 21. Patent information and dental implants patents over the years ................................ 54 Table 22. Implant assembly sub-domain patent information ................................................... 55 Table 23. Screw device sub-domain patent information .......................................................... 55 Table 24. Implant fixture sub-domain patent information ....................................................... 56 vi LIST OF FIGURES Figure 1. Research procedure and framework ............................................................................ 4 Figure 2. The construction of dental implants ............................................................................ 6 Figure 3. Overview of data mining ............................................................................................. 7 Figure 4. An example ontology tree: RFID technology ........................................................... 11 Figure 5. General process of clustering data. ........................................................................... 17 Figure 6. S-curve of TLC stages ............................................................................................... 24 Figure 7. Login page of IPDSS ................................................................................................ 26 Figure 8. Workspace of IPDSS ................................................................................................ 26 Figure 9. Processing of patent domain definition ..................................................................... 27 Figure 10. The key phrase analysis process ............................................................................. 29 Figure 11. Processing of ontology ............................................................................................ 33 Figure 12. Processing TYPE II ontologies ............................................................................... 35 Figure 13. Improvement steps of domain-specific ontologies ................................................. 37 Figure 14. Processing of life-span analysis .............................................................................. 41 Figure 15. Proposed life-span analyses of dental implant patent clusters ................................ 42 Figure 16. Dental implant ontology tree ................................................................................... 49 Figure 17. Dental technology ontology tree ............................................................................. 52 Figure 18. Life-span of dental implant clusters without expired patents ................................. 57 Figure 19. Life-span of dental implant clusters including expired patents .............................. 57 Figure 20. Life-span comparisons of dental implants over the years ....................................... 58 Figure 21. Process for building a domain-specific ontology .................................................... 63 1 1. INTRODUCTION This chapter describes the value of patent analysis and the importance of ontologies. The background, motivations, procedures, and objectives of this research are also discussed. 1.1. Research background The global dental equipment and product market is estimated to grow at a Compound Annual Growth Rate (CAGR) of 7% reaching US$27.6 billion by 2015 (Salisonline 2011). Some countries face challenges in the medical dental device industry, since they lack sufficient expertise and skills, public and private funding specially at early phases where innovation is developing at a high risk level, and also lack a national priority or strategy for the developing sector (University of Ottawa 2011). Other issues are litigation, government regulations and most importantly, Intellectual Property (IP) management of innovative technologies as well as existing core technologies (University of Ottawa 2011). It is important to maintain a high quality patent portfolio to quickly create innovative products, advance innovative technology development and protect innovations as well as avoid litigations (Trappey et al. 2011). Traditional patent analysis requires time, effort and expertise to interpret the research results. Recent patent analysis techniques use data mining as a tool to extract data of large volumes from information which decrease the time and effort for analyzing patents (Lee et al. 2009). These techniques apply statistical analysis techniques to automatically perform key phrase analysis, document clustering, life cycle analysis, and so on. High technology companies strive to orient R&D and strategic plans with emerging technologies. Patent documents are available through databases and are rich sources of information which provide a foundation for technology trend analysis. Patent documents in specific technology domains contain domain-specific terms which require domain experts and their experience to perform patent analysis (Jun & Uhm 2010). This limits the opportunity for researchers or R&D engineers to explore and understand patent information using data mining techniques. However, according to Wanner et al. (2008) using data mining techniques should include an ontology. Ontologies can be seen as an organized hierarchical structure with abstracted domain concepts and relations expressed in terms of domain terminologies and concepts. The advantage of using an ontology is that it contains a specific domain corpus to understand the meaning of these terminologies. The ontology helps 2 researchers and R&D engineers relate the data and increase the understanding of a specific domain as well as understand the relationship between domain concepts. The current life cycle stage of the technology influences companies when invest R&D capital on technologies (Haupt et al. 2007). A life cycle can be divided into four stages: introduction, growth, maturity and decline. The introduction stage of a new technology often includes technical problems or scientific fundamental problems and is associated with high risk of investments of R&D capital. A technology at the mature stage requires companies to evaluate the boundaries of patents to avoid infringement issues, trade IP, or license IP. 1.2. Research motivation A company must meet the market demands for new products. Designs to maintain market position consistently require advancement in technology and innovation. Companies with a strong patent portfolio use intellectual property as leverage in the marketplace to gain competitive advantage. In addition, companies face increased competitiveness and budget constraints to effective allocate resources to specific technology areas for research and development. The rapid advancement in technology and shorter product life cycles results in time constraints for companies to strategically plan R&D activities. Moreover, patent documents contain complex terminologies that require hiring high-fee experts that require much time for analysis. Since R&D engineers often do not have the skills to perform patent analysis. The main advantage of using ontologies is that it contains domain-specific knowledge and concepts which enable R&D engineers to gain valuable insights of that domain to understand concept relationships. Companies need help to direct R&D plans and evaluate the timing for future technology investments, a life-span analysis can be used to gain a quick snapshot of the potential technology clusters for potential investment strategies. Patent analysis helps companies to target R&D plans towards recent technology trends and identify future R&D plans (Trappey et al. 2011). Companies with a strong patent portfolio and conduct strategic patenting activities are more successful than other companies that remain inactive in for example the field of mechanical engineering and biotechnology sector (Ernst 1995; Austin 1993). Companies using patent analysis can observe technology development and identify potential competitors at the market place. Since a patent is granting the inventor exclusive rights for a limited time period to exclude anyone to produce or use this specific device, apparatus, process, or design (Grilliches 1998). In addition, a patent or patentability of a 3 technology is also one of the preconditions of the commercial potential of a technology. Moreover, patents can be used to generate life cycle analysis to monitor technology development and help companies to identify potential R&D investment opportunity or strengthen their IP-position (Haupt et al. 2007). Patent analysis can be used to study a country‟s performance (Grilliches 1990), governments use it to allocate resources to specific technology areas (Yoon & Yongtae 2007), and monitor technology development of competitors (Jun & Uhm 2010). Some companies avoid filing patents and patent only the most successful innovation. Subramanian & Soh (2010) explain that most high technology firms are in a patent race and patents are considered to be linked to the firms‟ performance. 1.3. Research procedure The research procedure is divided into 6 phases and listed accordingly:  Phase 1: Motive and objectives  Phase 2: Literature review  Phase 3: Selection of research direction The direction of this research is selected according to predefined objectives and literature reviews. Selected development is: data mining, key phrase analysis, technology clustering, patent document clustering, and technology life-span analysis.  Phase 4: Methodology development  Phase 5: Case study  Phase 6: Evaluation The research procedure and framework is shown in Figure 1. 4 Motive and objectives Literature review Selection of research direction Case study Evaluation Data mining Key phrase analysis Technology clustering Patent document clustering Technology life-span analysis 1. 2. 3. 5. 6. 4. Methodology development Figure 1. Research procedure and framework 1.4. Research objectives The goal of this research is the development of a domain-specific ontology framework, which will be used for technology clustering and life-span analysis of clusters. This research intend to: 1. To synthesis methodologies to generate the useful information to construct a domain-specific patent ontology structure and procedure based on dental implants patents 2. To develop a patent domain-specific ontology framework based on patent classifications to gain domain-specific knowledge and relationships between concepts 3. To develop a procedure for ontological sub-domain technology clustering and technology life-span analysis 1.5. Research limits This research is subjected to the following limitations. The first limitation points to patent data are only from the United States Patent and Trademark Office (USPTO). The ontologies are built based on patent data only from USPTO, the researchers experience and knowledge on the dental implant domain, and with help from dictionary (WordNet) to link the concepts in the ontology. 5 2. BACKGROUND This chapter has eight sections, separately discusses dental implant, data mining, patent analysis, key phrase extraction, clustering, technology life cycle analysis and research framework. The background covers the foundation methodologies data mining, key phrase analysis, patent technology clustering, patent document clustering and technology life cycle analysis for developing the methodology of creating a domain-specific patent ontology. 2.1. Dental implant The largest segment of the global market is restorative and preventive dentistry, but the fastest growing segment is dental implantology. Dental implant has a market share of 18% of the global dental device market (Markets and Markets 2010).The global dental equipment and product market is experiencing a steady growth due to a number of factors such as demand of dental implants, desire for aesthetics, and increased usage of dental preventive care (Brocair Partners 2010; Salisonline 2011). The popularity of cosmetic dental treatments and implants along with increasing demand for better dental care will drive the growth for new innovative dental implant technologies and products (Salisonline 2011). Currently, millions of people benefit from having dental implant since this is the ideal option for people, who has lost a tooth or teeth due to gum disease, an injury, or some other reason. Dental implant is an artificial tooth root that is placed into the jaw that holds a replacement tooth. Dental implants are the only possible option of replacing missing teeth which closely resemble a natural tooth and that behaves exactly like real roots and bonds naturally to the jaw bone. The crown is then bonded to the top of the dental implant. Paterson and Zamanian (2009) confirm that the dental implant industry will experience a strong growth from 2011 to 2015 for the global market and along with emerging technologies will improve the dental efficiency of dental procedures and reduce the time. For an example, Salisonline (2011) points out is 3D imaging techniques has improved patient diagnosis and procedure planning. Other emerging technologies in dental implant industry or biotechnology industry are dental biomaterials and tissue regenerative materials which offer a more natural and long-term solution. For the U.S and EU market, these trends are changing the customers need towards a shift to cosmetic dentistry and drive the dental implant market to high-end dental solutions and products (Salisonline 2011). Dental implant is defined as an implant that replaces a natural tooth (WordNet Princeton University, 2011). The main components of a dental implant include a screw that is able to 6 connect with a custom-made crown. Figure 2 describes the main components of a dental implant compared with a natural tooth. Figure 2. The construction of dental implants (Puja Dental Group 2011). Various components of dental implants are: implant body, cover screw (prevent to access the bone), transmuscosal abutment (links the implant body to the mouth), healing abutment (temporarily placed on implant to maintain potency of the muscosal penetration), healing caps (temporary covers for abutments), crowns, bridges, gold cylinder (to fit an abutment and form part of prosthesis), laboratory analogue (a base metal replica of implant or abutment), etc. (AstraTech Dental 2011 and Free Dental Implant 2011). 2.2. Data mining Recently, data mining is one popular alternative to use for extracting information from databases. Data mining applies machine learning and statistical analysis techniques to access and extract information from databases with large volumes of patent documents (Lee et al. 2009). It automatically discovers patterns in databases and is useful for mapping scientific and technical information for complex analysis of large volumes of information (Kim et al. 2008). Fayyad et al. (1996) states that it requires to develop methods and techniques to interpret the data for it to make sense for humans. The facts are that data volumes are growing rapidly in both objects and records in thousands of various fields, for example in the medical field it can easily be divided into hundreds of different fields (Fayyad et al. 1996). Data are captured for different purposes, in business for gaining competitive advantage or environmental data to better understand the effects. It has been applied in marketing, investment, fraud detection, manufacturing, and telecommunications. For example in marketing, the primary application with data mining is to analyze the database to identify 7 customer groups and forecast future buying patterns or behavior (Fayyad et al. 1996). Furthermore, data are a set of facts and pattern equals to the language that describes a model or finding a structure from the data. In general, by model or structure means the value of the pattern combined with validity, novelty, usefulness, and simplicity. All of these conditions does not define “knowledge” but rather define the framework of the pattern recognition of knowledge and it should also be taken into consideration that it is purely user oriented and domain specific as well as functions are determined by the user (Fayyad et al. 1996). Fayyad et al. (1996) states that the term Knowledge Discovery in Databases (KDD) is the overall process of discovering knowledge from a database and that data mining is actually a step in that process. Data or text mining are the same, historically it has been given a variety of names but the concept of finding patterns in data is the same. The concept of data mining is to use algorithms to extract patterns in data, also known as information retrieval (Fayyad et al. 1996). KDD uses additional steps such as data preparation (storage and access), data selection, data cleaning, incorporation of appropriate prior knowledge, and logical interpretation of the results, are important and need to be understandable knowledge extracted from the data, statistics to provide the framework and language for discovery of patterns (Fayyad et al 1996). The overall data mining framework is shown in figure 3. Figure 3. Overview of data mining, adapted from Han J. et al. (2011) 8 2.2.1. Text mining Text mining is a technique developed from data mining to analyze textual data especially unstructured (free text, abstract, etc.) textual documents for example patent documents (Lee et al. 2009; Kostoff et al. 2009). Text mining utilizes a technique to put a label of each document and link them to specific words which allows the discovery be based on labels (Lee et al. 2009). A text document is unclear, and according to Nasukawa and Nagano (2001), it contains various types of information, richness in information and represents factual information. Most information stored in a database is in the form of text documents. Text mining is used for automatic discover knowledge and patterns in a database by applying statistical algorithms (Weiss et al. 2005). Text mining is a broad field that involves information retrieval, text analysis, information extraction, clustering, categorization, visualization, machine learning, and data mining (Tan 2011; Lee et al. 2005). Patent documents contain detailed information in complex technical and legal terms only experts in specific field understand and the purpose is to make it difficult for non-specialist to read and analyze (Tseng et al. 2005). Patent documents are often lengthy and traditional patent analysis is inefficient, require long time and human effort to analyze the contents which also is highly expensive to maintain (Lee et al. 2009; Tseng et al. 2005). In August 16, 2011 the United States Patent Trademark Office (USPTO) issued patent number 8 000 000 and according to USPTO statistics there are a couple hundred thousand patent applications pending for examination each year (USPTO 2011). The accumulation of patent documents at the USPTO has increased at a striking pace due to more patent applications and granted patents (USPTO 2011). Therefore, it exist a great demand for automated data mining techniques for extracting information from the rapid growing volume of data into a compact form that can be easier to absorb. Large text databases such as USPTO database potentially contain great amount of information (knowledge) if and only if it can be interpreted. The traditional method of turning raw data into knowledge require manual analysis and huge amount of reading and organizing the content, thus it requires huge amount of workload for analyzing only a tiny fraction of the database (Lee et al. 2009). Furthermore, it is expensive and very subjective analysis is provided by the analysts (Fayyad et al. 1996). Recently, text mining has attracted researchers to apply it on patent analysis (Kim et al. 2008; Yoon and Park 2004). For example, Tseng et al. (2007) used text mining techniques to create a patent map for technology domain of carbon nano-tubes. Tseng et al. (2005) applied 9 text mining techniques to automatically create important categorization features that might be good as human derived or even better. One of the advantages using text mining techniques in patent analysis is that it can handle large volumes of patent documents and extract useful information (Lee et al. 2009). Since patent documents are lengthy but contain significant technical information and automatic text mining will assist researchers, engineers or decision- makers in patent analysis (Lee et al. 2009). However, extracted data has to meet specific quality criteria to be comprehensible for humans and also represent the concept of the text or benefit for the user (Yoon and Park 2004). Furthermore, text mining techniques has been applied for summarization, term association, cluster generation, topic identification, mapping information, technology trend analysis, automatic patent classification, and so on (Lee et al. 2009; Yoon and Park 2004; Tseng et al. 2007). 2.2.2. Limitations in text mining Although text mining seems to be a very promising technique for analyzing textual data there are some limitations such as areas that require accuracy. It is useful for providing supportive information for analysis (Smith 2002). Text mining techniques face a significant challenge in dealing with patent documents because algorithms cannot include compound words because of difficulties in determining them and cannot consider synonyms (Lee et al. 2009; Smith 2002). Furthermore, in terms of accuracy using text mining to make sure there is a distinction between documents it requires a large number of keywords (Smith 2002). Additionally, using text mining for unstructured data in patent documents, difficulties can be encountered in distinguishing texts or keywords that are describing “prior art” from texts which describe the invention (Smith 2002). This is important since the description will address the technical characteristics of the patent invention. However, despite these limitations, advances in computer science and better text mining algorithms are expected to strengthen text mining advantages making it more efficient and accurate. 2.3. Ontology Huang et al. (2008) describes that the concepts of ontology is a model which contain the concepts, links and relationships in a specific domain that reflects the reality of the world. WordNet from Princeton University (2011) define ontology as: a rigorous and exhaustive organization of some knowledge domain that is usually hierarchical and contains all the relevant entities and their relations. Ontology provides a unified knowledgebase expressed in the information domain that has integrated information from various sources (Taduri et al. 10 2011). For example, a company that needs to collect information of a specific technology and has only initial knowledge, domain-specific ontology can be used to collect relevant information much faster than existing systems (Taduri et al. 2011). The ontology contains domain concepts and relations which can be reused, modified, and shared among R&D engineers (Soo et al. 2006). Ontology provides an organized framework with a hierarchical structure and relationships of the domain which offers the possibility to understand relations between concepts (Rubin et al. 2007). Ontology links the semantic data between concepts which makes it possible to perform pattern recognition, similarity, and clustering of patent documents with respect to its content (Wanner et al. 2008). A variety of methods has been proposed to create knowledge domains and one of the methods suggests a single ontology that integrates all knowledge domains (Taduri et al. 2011). The potential drawback of this method is that it creates a very large set of knowledge domains which depending on the application may be unnecessary and inefficient (Lau et al. 2011). Alternatives ontology architectures propose having separate ontologies that are domain-specific which are application specific (Noy & McGuiness 2001). Many methodologies have been proposed and are used to create ontologies for capturing domain knowledge of patents to enhance the information retrieval (Taduri et al. 2011). Ontology-based patent intelligence A significant effort has to be taken into consideration when gathering relevant patent information across different patent databases (Khelif et al. 2007). For example, a start-up company wanting to patent their technology in the field of dental implants wants to search patent databases, scientific publications, and perform patent analysis for infringement purposes and competitor analysis (Taduri et al 2011). They face challenges to thoroughly search for patent documents and the large volumes of patent documents makes it an almost impossible task (Lau et al. 2011). Patent documents in specific technology domains contain domain-specific terms which cannot be covered in common dictionaries, therefore an advantage of ontology is that it contain specific domain terminologies (Soo et al. 2006). Trappey et al. (2010) points out that ontology are useful to extract related concept of key phrases. Ontology serves as an organized structure for arranging or classifying a domain. In addition, ontology is a way of formally represent knowledge domains with concepts, their attributes and relations between terminologies expressed in some well defined logic (Rubin et al. 2007). According to Wanner et al. (2008), using text mining techniques such as key phrase extraction for representing the content of a patent document it should also contain important 11 feature: ontology. The goal of using domain specific ontology is to reduce conceptual and terminological confusion among R&D engineers (Navigli and Velardi, 2004). In addition, sharing the domain-specific ontology can improve the communication, cooperation among people, better enterprise organization, and system engineering (reusability, reliability, and specification) (Navigli and Velardi, 2004). Ontology approach for knowledge extraction has been applied in various fields. For example, Trappey et al. (2009) applied an ontology tree for automatic patent document summarization which extracts key information into a shorten abstract with the key concepts of the patent document. The goal was to use the ontology to create a knowledge base for their software program to improve its architecture and consistency in capturing knowledge of their information system. Figure 4 demonstrates the example RFID technology ontology tree from Trappey et al. (2010) that were used for patent document summarization. Rubin et al. (2007) propose how biomedical ontologies can help researchers to accelerate their research because biomedical information available is exploding. Biomedical ontologies help researchers to structure complex biological domains and relate the data, for example, the gene ontology have gained huge attention in the biomedical community (Rubin et al. 2007). Figure 4. An example ontology tree: RFID technology (Trappey et al. 2010) RFID Ontology RFID Device Wired Connector Wireless Reader Memory Frequency band Tag Standard Unit Value Antenna Frequency Band Active Passive Impedance directivity Gain Wave Direction Communication Tolerance Protocol Security Processor Person Plant Parts Creature Item Circuit Animal Encoding Range RFID application RFID Portable Tracking Interaction Identification Personal tracking Asset tracking Animal tracking Inventory Access Distribution 12 To create a domain-specific patent ontology requires phrases that describe the concepts of patent documents (Trappey et al. 2010). It requires identifying and defining relevant concepts and relating it to a given application (Navigli and Velardi, 2004). The challenge when dealing with specific technical domains such as telecommunications, biotechnology, biomedical, there is specific technical or domain specific terminology in patent documents (Taduri et al. 2011). The terms are often a challenge since it is presented in several forms such as synonyms and hyponyms etc. which makes the general language comparison in patent documents inefficient. Taduri et al. (2011) and Mukherjea & Bamba (2007) points out the advantages using ontology to capture the rich information available and allows the application to understand the semantics associations to avoid terminological inconsistencies. It also allows users to reason across the knowledge domain where some applications require small fragments of information which let users to choose to work with only information that is needed (Lau et al. 2011). For example, R&D engineers may only be interested in a technological sub-domain and ignore the other knowledge sub-domains. 2.4. Patent analysis Patent documents contain rich detailed information about research results that are in complex technical and legal terms, it is valuable to the industry, business, law, and policy- making communities (Tseng et al. 2007; Choi et al. 2007). Thus, the detailed content in patent documents, if carefully analyzed, can reveal technology development, inspire novel technical solutions, show technical relations, or help investment policy (Tseng et al. 2005). Tseng et al. (2007) point out that patent analysis has become important even at government level at some Asian countries such as China, Japan, Korea, Singapore, and Taiwan. These countries have invested various resources to create visualized results of patent analysis (Lee et al. 2009). Patents are a useful vehicle for R&D and technology management research since it is a source of technical and commercial information which can be turned into knowledge (Choi et al 2007; Lee et al. 2009). Patent documents are often lengthy and require time, effort and expertise to interpret the research results into a technology development analysis. Tseng et al (2007) also emphasize that patent analysts also need a certain degree of expertise in information retrieval, domain-specific technologies, legal knowledge and some business intelligence. A typical patent analysis scenario is showed on Table 1, these multi-discipline area require hard to find analysts or costly to train and maintain. Thus, automated technologies assisting patent analysis are in great demand. 13 Patent analysis can be divided into two levels of analysis, macro level research of national or industrial analysis and micro level research of specific technology development or forecasting (Choi et al. 2007). Macro level analysis evaluates the major economical effect of technological innovations, technological development and competitiveness of countries (Grilliches 1990). At micro level, the focus is to identify technological development of specific areas/technologies, advantages and disadvantages of competitors, strategic planning of R&D activities, and patent data are analyzed to find the relation between companies and technologies (Haupt et al. 2006). Many researchers have tried to identify indicators or determinants of patent value (Sapsalis et al. 2006). By using different types of data sets of patent data such as regional patent offices, particular sample, specific sector such as biotechnology, or particular company in a given country. Several researchers have studied patents and its relationship or effect on economy, technological innovations/development, or a country‟s competiveness (Grilliches 1990). It is important since on average only 1-3 patents out of 100 can generate significant financial returns. Although only a few patents have commercial success, most patents are developed by follow-up patenting into significantly important technologies (Ernst 1997). High value patents often have broad technical claims and a high citation index which increase the financial value of the company (Lerner 1995). Companies with a strong patent portfolio and conduct strategic patenting activities are more successful than other companies that remain inactive in the field of mechanical engineering and biotechnology sector (Ernst 1995, 2001; Austin 1993). Patent analysis can be effectively used for companies to gain competitive advantages at market place (Grilliches 1990). Moreover, patents are easily accessible throughout the world through databases in most countries (Lee et al. 2009). Before issuing a patent at USPTO each patent document are given one or several patent classification based on invention, claims, and content (Tseng et al. 2007). These classifications are denoted as UPC (U.S. Classification) and IPC (International Patent Classification) these are given in most patent documents. According to Tseng et al. (2007), patent classifications are sometimes too broad or cannot meet the requirements for particular analysis. In this research, UPC and IPC are used for analysis and examples of UPC and IPC definitions for dental implant are shown in Table 2. 14 Table 1. Description of a typical patent analysis scenario adapted from Tseng et al. (2007). A typical patent analysis scenario 1. Task identification Define the scope, concepts, and purposes for the analysis task 2. Searching Iteratively search, filter, and download related patents 3. Segmentation Segment, clean, normalize structured and unstructured parts 4. Abstracting Analyze the patent content to summarize their claims, topics, functions, or technologies 5. Clustering Group or classify analyzed patents based on some extracted attributes 6. Visualization Create technology-effect matrices or topic maps 7. Interpretation Predict technology or business trends and relations Source: Tseng et al. (2007). Table 2. Definition of patent classifications for dental implants Patent classifications International Patent Classification (IPC) Class Dentistry; Apparatus or methods for dental hygiene A61C 8/00 Means to be fixed to the jaw-bone for consolidating natural teeth or for fixing dental prostheses thereon; Dental implants; Implanting tools A61C 13/00 Dental prostheses; Making same (tooth crowns for capping teeth; dental implants) U.S. Patent Classification (UPC) Class 433 Dentistry Subclass 433/173 By fastening to jawbone: This subclass is indented under subclass 172. Subject matter wherein the denture is secured directly to the jawbone of the patient. Subclass 433/174 By screw: This subclass is indented under subclass 173. Subject matter wherein the denture is secured to the jawbone by an elongated helically ribbed member Source: USPTO and WIPO (2011) 15 2.4.1. Data in patent documents A patent document contains items/details which can be divided into two groups, structured and unstructured data (Tseng et al. 2005). Structured data in patent documents are uniformthrough most patents such as patent number, filing date, inventor, and assignee. Unstructured data are defined such as free text of various length and content, claims, abstracts, or description of the invention. Patent analysis using structured information such as filing dates, assignees, and citations etc. have been in practice and literature for years (Ernst 1997; Lai & Wu 2005). The visualized results of structured data (patent number, filing date, etc.) is called patent graph and most use bibliometric data of patent documents to provide statistical results for patent analysis (Lee et al. 2009). Unstructured data (abstract, free text, etc.) are called patent maps. However, the general term patent maps can be used to describe both structured and unstructured data (Tseng et al. 2007). Patent maps are the visualization step in Table 1. Patent maps can be used for decision-making about future R&D directions (understanding patent relations and how patents are invented in the past), or predict technology/business trends (trend of major competitors in the same industry), and discover technological trends and opportunities as well as technological holes for future innovations (Tseng et al. 2007; Choi et al. 2007). Bibliometrics is defined as the measurement of texts and information (Norton 2001). In general, most patent analysis utilize bilbiometric data (structured data) which explore, organize and analyze large amounts of data in order to identify patterns such as authors, technology field, citations, and so on (Daim et al. 2006). Although there are many items for analysis, one in particular has been employed more frequently, citation analysis. Patent citations or citation analysis are defined as the count of citations of a patent in subsequent patent, and thus citations per patent represents the relative importance of the patent (Lee et al. 2009). One possible reason can be as Sapsalis et al. (2006) point out is that citations analysis are closely associated with patent value (increase of financial value of a company). However, the scope of analysis using bibliometric data is easy to understand and to create but are subjected to limited access of the richness of information in patent documents since it only uses bibliometric fields (Lee et al. 2009). Text mining has been proposed as an alternative to analyze unstructured textual data in patent documents (Kim et al. 2008). 16 2.4.2. Limitations in patent analysis There are certain limitations using patent analysis as indicator for forecasting technology development or business trends. First, not every company or organization patent their invention and Choi et al. (2007) mention that for example not all inventions meet the criteria made by patent offices and also some companies or industries rely on secrecy also it is a strategic decision not to patent an invention. Second, the results from patent analysis are interpreted differently across industries and companies, which results in inconsistent analysis. Third, patent laws changes over time which makes it difficult to analyze over time but recently companies are more inclined to file patents to mainly protect their invention from competitors (Choi et al. 2007). 2.5. Patent clustering Recently, cluster analysis has become an important topic because of recent decade of advancement in data mining, increased computer power, and increased statistical software packages that include cluster analysis algorithms (Kettenring 2009). Given a set of documents, often there is a need to categorize documents into groups or clusters. For a small set of documents it can be done manually, on the other hand for a large set of documents the process will be time consuming and inefficient (Shahnaz et al. 2006). A patent document usually consists of a title, an abstract, claims, detailed description of the invention and bibliographic information. Moreover, all patent documents have manually assigned International Patent Classifications (IPCs) and if issued at USPTO it also consists of United States Patent Classification (UPC). Classification codes, IPC & UPC, are manually clustered by patent specialist or examiners. This type of classification is called supervised since it has predefined categories or topics for classification (Shahnaz et al. 2006). Unsupervised classification often deals with unstructured data. The goal is to organize and structure the unstructured data into groups or clusters based on the patterns of the collection itself (Dunham 2003). According to Trappey et al. (2010), patent documents with the same classification codes may be entirely different. Clustering methodology is an important data analysis technique, which classify patterns of key phrases into categories based on the characteristics of relationship (Trappey et al. 2009). The main concept is to measure the similarity in data and categorize it to the most suitable cluster and maximize the similarity of specified variables within the same cluster, in other words, create a homogenous cluster. It is necessary that each patent document belonging 17 to a cluster to be similar. The importance according to Almeida et al. (2007) is presence of high connectivity among these patent documents which is high association between objects. Clustering methodology has been applied to numerous of different fields. For example, Taiwan Semiconductor Manufacturing Company, Ltd, use clustering analysis to detect errors in the manufacturing process, by isolating and separating failure symptoms and group suspicious process steps for evaluation by the process engineer (Kettenring 2009). It has also been used in predicting consumer behavior by creating shopping clusters of consumers‟ previous purchasing behavior or patterns, to forecast future shopping behavior (Kettenring 2009). A general clustering approach is shown in Figure 5. Data Conversion Similarity Evaluation AnalysisClustering Data Collection Results  Figure 5. General process of clustering data. Adapted from Trappey et al. (2009) Patent technology clustering Patent technology clustering is a method to group similar or technology related patent documents into clusters rather than by UPC (Trappey et al. 2010). Patent technology clustering makes it possible to analyze the relationship between patent documents in specific domain technology and also the possibility to analyze patent or trends and development (Trappey et al. 2008). Patent technology clustering is derived from using key phrase correlation matrix as input and by applying the K-means algorithm (Trappey et al. 2010; Trappey et al. 2009). A more complete discussion on functions of K-means algorithm is provided by Han et al. (2011). Furthermore, the Root Mean Square Standard Deviation (RMSSTD) and R-Squared (RS) is used by Trappey et al. (2009) to find the optimal number of clusters in a set of data. RMSSTD is the standard deviation of all variables and represent the minimum variance in the same cluster therefore the value of RMSSTD should be as small as possible to gain optimal results. RS describe the maximum variance between different clusters and the value of RS should be as large as possible because RS is the sum of squares between different clusters divided by the total sum of squares for the set of data. A more detailed description of equations and functions of RMSSTD, RS and K-means are described by Trappey et al. (2009), Trappey et al. (2010) and Trappey et al. (2008). 18 Patent document clustering Patent document clustering uses the correlation matrix generated from patent technology clustering as input in K-means algorithm (Trappey et al. 2010). Patent document clustering is a method that measures the internal relationship of the key points of the patent document and classifies patent documents based on the similarity of the technologies (Taghaboni-Dutta et al. 2009; Trappey et al. 2010). As a result it makes it easier for patent analyst to analyze the characteristics of patent documents in the clusters. This also solves the problem of patent classification systems (IPC and UPC) which may place the same code on patent documents which may be entirely different in technology (Taghaboni-Dutta et al. 2009). As shown in Table 3, the matrix is used as an input for patent document clustering. Table 3. Key phrases and patent correlation matrix (Trappey et al. 2010) Patent1 Patent2 Patent3 … Patentn TC1 N1,1 N1,2 N1,3 .. N1N TC2 N2,1 N2,2 … … N2N TC3 N3,1 … … … N3N … … … … … … TCn … … … … Nnm Source: Trappey et al. (2010) 2.6. Key phrase analysis Key phrase extraction is useful for document or information retrieval, document clustering, summarization, text mining, and so on (Matsuo and Ishizuka 2003). Turney (2000) also point out a dozen useful applications with key phrase extraction for example, highlighting key phrases in text, document classification, text compression, or constructing human-readable text. Most information stored in databases is textual documents. Extracting key phrases makes it possible to determine which document is important and also identify the relation among several documents since it extracts relevant key phrases (Matsuo and Ishizuka 19 2003; Hammouda et al. 2005). According to Voorhees (1999), the majority uses statistical approaches for information retrieval (key phrase extraction) because of the assumption that two texts in the same topic use the same key phrases. Statistical approach measure the similarity of key phrases between textual documents. There are different approaches for key phrase extraction and the most commonly used are a lexical approach, natural language processing (NLP), or term frequency approach (Trappey et al. 2008). Hammouda et al. (2005) divide key phrases extraction algorithms into two categories: key phrases extraction that requires supervised learning and are applied for single documents, on the other hand, key phrase extraction on a set of documents are unsupervised and self-learning which discover rather than learning from examples, also known as knowledge discovery. Research points towards that key phrases main goal are to represent the topics discussed in any text document (Turney 2000). Furthermore, Turney (2000) point out the relevance using key phrase extraction such as it enable the user quickly to determine if the key phrases are in the field of interest and it can be used for relevant indexing based on the key phrases. Key phrases extraction has been applied in many different fields, although mainly for summaries purposes (Turney 2000). For example Nenkova et al. (2006) studied the impact of automatic summarization systems based on key phrase extraction and its role in human summarization, the results showed that the key phrase frequency methodology used generated summaries comparable with state-of-the-art systems. Trappey et al. (2008) are using a hierarchy and semantic relationship concept to create a summarization system that uses key phrases to summarize any patent document based on the specific domain of the patent document. 2.6.1. Term frequency approach The term frequency (TF) approach is based on the assumption that high frequent key phrases in a text document are more relevant to the concept of the content (Trappey et al. 2008). Robertson (2004) also points out that high frequency of a term represent a document better. Furthermore, in information retrieval of terms (key phrase), the most common terms are used in weighting schemes to represent text documents (Aizawa 2002). For example, Robertson and Sparck Jones (1976) study the relevance of weighting methods of key phrase using term frequency weighted with the inverse document frequency (TF-IDF). Trappey et al. (2008) uses a normalized TF-IDF to extract key phrases and phrases for clustering of patents. 20 The concept of TF-IDF is that it weight frequent key terms in a series of documents to determine its relevance. Therefore, frequent key terms in one document cannot represent a domain but frequent key terms in a series of document might represent the concept of the domain (Robertson and Sparck Jones, 1976). The basic formula of IDF used by Robertson and Sparck Jones (1976) and Trappey et al. (2007) is expressed as: ( ) ( ) where is the total number of documents in the collection and is the number of documents in the collection which containing term . itself represent the inverse document frequency (IDF) of term . Trappey et al. (2007) describe as a value of representation of term and if becomes a significant high value, the term can represent a specific document. The weighting of key phrase using TF-IDF in text documents where TF are weighted in IDF is according to Trappey et al. (2007) expressed as: ( ) where is defined as weight of term in document of the collection, is the number of term that occurs in document of the collection, and is the inverse document frequency of term . Therefore, the highest value of equals the most frequent key phrase in a specific text document and are identified as the key phrase for any document . Furthermore, Trappey et al. (2008) normalize TF-IDF because of TF-IDF is a method that does not consider the difference of number of words in each document, therefore Trappey et al. (2010) applied a normalization of the weights frequency of key phrases by the number of words in each documents. According to Trappey et al. (2010), the normalized TF-IDF (NTF) can be expressed as following: ∑ ( ) where is the number of term that occurs in document of the collection, is the words number of document , and is the total number of documents in the document collection. 21 2.6.2. Key phrase correlation matrix The key phrase correlation matrix calculates the correlation of important key phrases (KP) in each patent document which is used to understand the logical link between concept and methodologies (Trappey et al. 2010). Trappey et al. (2010) describes the methodology of using TF-IDF and NTF to calculate the correlation between key phrases to create a key phrase correlation matrix using inner product of vectors expressed as: ( ) ‖ ‖‖ ‖ ∑ √∑ ∑ ( ) where ( ) and ∑ average Word Number (WN). Trappey et al. (2010) use an algorithm of four stages. First, the algorithm transforms the patent document into a key phrases vector and analyzes the frequency of key phrases and phrases. Second, derive the key phrase vector by eliminating unnecessary key phrases and phrases. Third, the correlation values between key phrases are calculated using Equation (4). Fourth, the correlation coefficients are derived by the number of different key phrases occurring in each patent document. The key phrases correlation matrix is shown in Table 4. Table 4. Key phrases correlation matrix KP1 KP2 KP3 … KPn KP1 R1,1 R1,2 R1,3 .. … KP2 R2,1 R2,2 … … … KP3 R3,1 … … … … … … … … … … KPm … … … … … Source: Trappey et al. (2010) The key phrase correlation matrix is used as an input for patent technology clustering. Key phrase correlation matrix represents the technology in each patent document and thus it 22 provide the internal relationship among patent documents instead of clustering patents according to classification codes such as UPC or IPC. 2.6.3. Key phrase and patent correlation matrix In the key phrase and patent correlation matrix, the frequency (Fnm) of each key phrase (KP) appearing in each patent document is calculated as well as NTF, Rate (%) and NTFR. The Rate describes the percentage of KPm occurring among Patent1 to Patentn. NTFR is the product of NTF and Rate which express the relevance of KPm among the patent collection, shown in Equation (5). The key phrase, KPm, is a representative phrase in the patent, Patentn, if the frequency, Fnm, is large enough across Patent1 to Patentn, then KPm is a representative phrase of Patentn (Trappey et al. 2010). The key phrase and patent correlation matrix is shown in Table 5. ∑ ( ) ∑ If Fnm = 0; Xnm= 0 Fnm > 0; Xnm= 1 Table 5. Key phrases and patent correlation matrix Patent1 Patent2 Patent3 … Patentn NTF Rate (%) NTFR KP1 F1,1 F1,2 F1,3 .. … … … … KP2 F2,1 F2,2 … … … … … … KP3 F3,1 … … … … … … … … … … … … … … … … KPm … … … … Fnm … … … Source: Trappey et al. (2010) 23 2.6.4. Limitations in key phrases extraction of textual documents Nasukawa and Nagano (2001) mentioned some issues using key phrases to represent a textual document. The problem is that textual documents are unclear because of natural language is ambiguity and same key phrase may have different meanings in the same textual document (Nasukawa and Nagano 2001). For example the word “watch” can represent a timepiece, to look, to observe or pay attention. Different words can also represent the same meaning, for example “laptop” and “notebook” or “cellular phone” and “mobile phone”. 2.7. Technology life cycle analysis Life cycle analysis, as the name implies, is a straightforward methodology that assess all impact on a product or service, from initial extraction of raw material to the final output or disposal of the product (Ayres RU 1995). When companies invest R&D capital on technologies, it often depends on current life cycle stage of the technology (Haupt et al. 2007). According to Haupt et al. (2007) and Ernst (1997), patent documents inform us about technical development and the life cycle stage of an industry since patent documents contain core technology information. A patent or patentability of a technology is also one of the preconditions of the commercial potential of a technology. In addition to these information, patent document contain data about patent application date which inform us about the life cycle of different products, based on the technology, before it can start being commercialized (Haupt et al. 2007). The concept of technology life cycle is similar to product life cycles and can be divided into four stages: introduction, growth, maturity, and decline or saturation (Haupt et al. 2007; Trappey et al. 2010). Haupt et al. (2007) also point out that regardless of what reference factor is for technology life cycle or that the patent based life cycles starts earlier than product/sales based one, the principles can still be applied for technology life cycle as for product life cycle. Several studies on technology life cycle based on patent document information show that an S-shaped curve can represent the technology life cycle. The S-shape curve include the four stages, introduction, growth, maturity and decline (Haupt et al. 2007). Andersen (1999) studied the S-curve with examples from the pharmaceutical industry. Trappey et al. (2010) studied the RFID technology in China and forecasted potential market and R&D opportunities. Another study by Trappey & Wu (2008) used S-curve analysis technique to evaluate short product life cycle products like electronics. The beginning of the life cycle, the introduction stage, of a new technology is the development of the scientific fundamental 24 problems. These technical problems have to be solved in order to rapidly progress in technological advancement and during this period of time awaits radical innovations. At this stage, the patent applications are low but slowly increasing because during this period there is a lot of uncertainty and there are pioneer firms that are willingly to take the R&D risk (Haupt et al 2007; Trappey et al. 2010; Trappey & Wu 2008). During this stage the patent application per applicant is relatively high compared with other stages of the life cycle and this is because of the problems of new innovative technologies as well as the cost is too high for customers‟ acceptance or standardization of the product has not evolved yet. During the growth stage is when the fundamental technical problems have been solved and the market uncertainty has “vanished”, many products is developed based on this technology, R&D risk decreases, and resulting in increase of patent applications (Haupt et al. 2007; Trappey et al. 2010). The growing number of patent application also decreases the patent application per applicant due to new competitors. The technology enters a mature stage when the number of patent applications is constant and there are now new features developed for this technology. Thereafter the technology enters the decline or saturation stage. Patent activity is an important indicator of current technology life cycle and furthermore, Haupt et al. (2007) and Ernst (1997) have implemented this S-curve methodology on niche technologies such as pacemaker technology. Ernst (1997) proposed that all cumulative patent applications per year for a specific technology over a certain period of time can be plotted as S-curve and the different technology life cycle stages can be analyzed. An example of the principles of S-curve is shown in Figure 6. Introductory Stage Growth Stage Maturity Stage Decline Stage Total Market Sales Time A cc u m u la te d p at en ts p er y ea r 1 Figure 6. S-curve of TLC stages. X-axis represents a period of time and Y-axis represents accumulated patents over the time period (Adapted from Trappey et al. 2010). 25 The technology life cycle analysis is important for companies to evaluate the timing of R&D or other investment opportunities of technologies. It is strategically important to account for technology life cycle analysis when for example at the introduction stage. Companies should aggressively apply for patent families of their core invention (patent) to strengthen their position at the marketplace. If however, the technology is at the growth phase it is important for companies to search for core technologies in that field and develop their own applications. At the mature stage it is important to evaluate the boundaries of patents to avoid infringement issues or create alliances to trade IP. Finally the declining stage implies that new technology is replacing the old and is the beginning of a new technology life cycle (Trappey et al. 2010; Trappey and Wu 2008; Haupt et al. 2007; Ernst 1997). 2.8. Research framework This research proposes a domain-specific patent ontology methodology for technology clustering and life-span analysis. The methodology steps are based on previous research by Trappey et al. (2010), as shown in Table 6. Trappey et al. (2010) referenced a modified ontology to extract key phrases from patent documents. The first step is to define a patent domain and select IPC and UPC in this domain. The second step is to collect domain specific patent documents from USPTO database. These steps are completed by using data mining software Intellectual Property Defense-based Support System (IPDSS) (Wheeljet, 2011). Table 6. Methodology outline of this research Method by Trappey et al. (2010) Method in this research 1. Data preprocessing 2. Key phrase analysis (TF-IDF) 3. Key phrase correlation measure 4. Patent technology clustering 5. Patent document clustering 6. Lifecycle analysis 1. Define domain (IPC & UPC) 2. Data preprocessing 3. Key phrase analysis (NTFR) 4. Process ontology 5. Ontological sub-domain technology clustering 6. Life-span analysis of clusters 26 3. METHODOLOGY This section describes the methodology development of this research and is divided into four parts patent domain definition, key phrase analysis process, processing domain-specific ontology, and technology life-span analysis of patent clusters. The IPDSS (Wheeljet, 2011) is used as a tool for data mining, key phrase extraction and clustering. Figure 7 and 8 shows the IPDSS software used in this research. Figure 7. Login page of IPDSS Figure 8. Workspace of IPDSS 27 3.1. Patent domain definition The first step is to define a specific patent domain and select relevant UPCs or IPCs. As shown, in Figure 9. As described at the literature review section, patent under the same classification code may be entirely different in technology (Taghaboni-Dutta et al. 2010). Patent database (USPTO) Defined patent domain Patent document collection Study patent figures and abstracts Define technology specific patents Other patents USPTO website Study relevant UPCs Patent search Select relevant UPCs Study UPCs patents Delete or add UPCs Stage 1 Stage 2 IPDSS Defined patent domain Domain-specific patents on IPDSS Key phrase analysis process Figure 9. Processing of patent domain definition Stage 1 is to select patent domain based on patent classifications following these steps: 1. Use USPTO or WIPO website and study UPC or IPC definitions for specific domain chosen. Patent classifications, UPC and IPC, are described at USPTO and WIPO, respectively (USPTO 2011; WIPO 2011). 2. Select relevant IPCs or UPCs. (Note: Use either IPC or UPC, not both). 3. Include 5 patent classifications to 15 patent classifications. 4. Search for 3-4 patents for each individual IPCs or UPCs on USPTO or WIPO. 5. Study those patents to determine if chosen IPCs or UPCs (from step 2) are relevant or not for chosen domain. Delete or add patent classifications to your domain. 6. Patent domain defined. 28 Stage 2 is to collect training domain-specific patent documents according to UPCs or IPCs. Thereafter define the technology specific patents and exclude other. Following these steps: 1. Use Intellectual Property Defense-based Support System (IPDSS) to download 150 training patents from USPTO if UPC is chosen (WIPO if IPC is chosen) (Note: The patents have to be according to patent classifications chosen). 2. Study patent figures and abstracts to define technology specific patents according to chosen IPCs or UPCs. Delete other patents. (Note: Patents under same classification code might represent different technology). 3. Key phrase analysis process of domain-specific patents In this research, a dimension represents a domain, for example dental implant can also have dental implant tools and dental implant materials. One dimension includes several important classification codes to represent key concepts and technology. However, it requires limiting classifications to be technology specific. The choice of IPC or UPC depends on specific domain and technology. The software IPDSS has a function to connect with USPTO database (or WIPO) to perform patent search and download patents to IPDSS database. IPDSS also automatically preprocess all patent documents into standard format which means that spaces between words and phrases is removed to automatically perform frequency count of words and phrases of each patent document. For each dimension, key phrases are separately extracted from dental patent documents to build a domain-specific ontology of dental implants. 3.2. Key phrase analysis process The key phrase analysis process generates a list of frequent and important phrases from each patent document. These phrases are used to form a logical link between concepts. In this research, key phrases analysis, key phrase correlation matrix and key phrase and patent correlation matrix are derived using IPDSS which apply the methodology normalized term frequency – inverse document frequency (NTF), as shown in Figure 10. The following sections will discuss NTF, key phrase correlation measure, key phrase and patent correlation matrix. 29 IPDSS Raw patent document input Data preprocessing Key phrase correlation measure Key phrase and patent correlation matrix Output Defined domain-specific patents NTF calculation Key phrase vector output Key phrase analysis process Figure 10. The key phrase analysis process Normalized term frequency – inverse document frequency (NTF) After IPDSS perform data preprocessing, the weight of each term is calculated using IPDSS that apply normalized term frequency – inverse document frequency (Trappey et al. 2008). As described at the literature review section, TF-IDF is a weighting method that weight frequent terms in a series of documents to represent text documents (Aizawa 2002). However, TF-IDF is a method that does not consider the difference of number of words in each document (Trappey et al. 2010). Therefore, normalization is applied to the weights frequency of key phrases by the number of words in each document. Robertson (2004) points out those high frequent terms represent a document better. The normalized TF-IDF (NTF) can be expressed as following: ∑ ( ) = the number of term that occurs in document of the collection = the words number of document = the total number of documents in the document collection. The frequency (Fnm), Rate, NTF, and NTFR are calculated and tabulated at Table 7 which is the output of the key phrase analysis. 30 Table 7. Key phrases and patent correlation matrix Patent1 Patent2 Patent3 … Patentn NTF Rate (%) NTFR KP1 F1,1 F1,2 F1,3 .. … … … … KP2 F2,1 F2,2 … … … … … … KP3 F3,1 … … … … … … … … … … … … … … … … KPm … … … … Fnm … … … The formulas are described as following: ∑ ( ) ( ) = The number of key phrases (belonging to KPm) that are included in patentn KPFi = The frequency of the key phrase m (belonging to KPm) of document j The NTFR-value is expressed as ( ) ∑ if Fnm = 0; Xnm= 0 OR Fnm > 0; Xnm= 1 Key phrase correlation measure IPDSS calculates the correlation values between key phrases to create a key phrase correlation matrix using inner product vector expressed as: ( ) ‖ ‖‖ ‖ ∑ √∑ ∑ ( ) where ( ) = the vector of key phrase i ( ) = the vector of key phrase j 31 ∑ average Word Number (WN) ( ) = the number of term that occurs in document of the patent collection = the number of documents in the collection which containing term ( ) ( ) = the total number of documents in the collection First, the algorithm transforms the patent document into a key phrases vector and analyzes the frequency of key phrases. Second, derive the key phrase vector by eliminating unnecessary key phrases. Third, the correlation values between key phrases are calculated using Equation (11). Fourth, the correlation coefficients are derived by the number of different key phrases occurring in each patent document. The correlation coefficient is calculated according to the formula below: ( ) ∑ ( ) ( ) where ( ) = the correlation value of key phrase i and key phrase j in document k n = the total number of documents in the patent collection After the correlation coefficient is calculated, it can be shown as the key phrases correlation matrix in Table 8. The frequency is calculated for all terms and KPfi is used for the frequency of key phrase KPi in the document. RPfij is used to represent the frequency of related phrases RPij. The correlation of RPij and KPi are listed as Rij. The final frequency of KPi can be calculated as following: ∑ ( ) A vector is created after all the KPF is calculated for all key phrases and are listed as following: [KPF1, KPF2, …, KPFn] (13) 32 This vector is used as the input of patent technology clustering and patent document clustering. Table 8. Key phrases correlation matrix KP1 KP2 KP3 … KPn KP1 R1,1 R1,2 R1,3 .. … KP2 R2,1 R2,2 … … … … … … … … … KPm … … … … Rmn Source: Trappey et al. (2010) 3.4. Processing domain-specific ontologies The domain-specific ontology is build by using Microsoft Visio 2007 (MS Visio) as a visualization tool for transferring domain-specific ontological schema (Huang et al. 2008). Ontology can be visualized as a pyramid and on the top of the pyramid represent the domain concept. In this research, the ontology structure is based on RFID ontology tree in Figure 4 and top 50 NTFR-values of key phrases in the key phrase matrix (from the key phrase analysis process) are chosen to build the ontology. The processing of the ontology is described by the following steps and the overview processing is shown in Figure 11. Step 1: Organize patents with the same patent classification codes This step utilizes the key phrase matrix output from the key phrase analysis process. First, patents with same classification, for example Patent1, Patent6-10, and Patent23-28 has the same patent classification are grouped together. Second, analyze the frequency (Fnm) of each key phrase (KPn) of the key phrase matrix and determine which KPn is expressed in which UPC. This enables an overview of which classifications uses the same key phrases and this can be tabulated as shown in Table 9. Step 2: Map classification codes – TYPE I ontology From previous step, a new matrix is constructed, as shown in Table 10, to visualize which key phrase is expressed for each UPC. The Type I ontology is constructed using MS 33 Visio, as shown in Figure 11, the key phrases are placed out to visualize common phrases among patents. Key phrases that are expressed in the same number of UPCs, for example UPC1, UPC2, and UPC3 are colored using your own defined coloring scheme and a different color for UPC1 & UPC2. It helps R&D engineers to visualize common phrases in this domain by coloring scheme of key phrases. Patent database (USPTO) Defined UPC Key phrase analysis process Step 2 Improvement steps Domain-specific patents Key phrase analysis process Key phrase matrix Processing ontology Organize patent classifications Map classification codes Step 1 Step 2 Pre-define ontological sub- domain Organize key phrases Step 3 Step 4 KP6 KP12 Key phrase correlation vector KP1 KP2 Patent documents Cluster 1 Patent documents Cluster 2 Key phrase analysis process Key phrase matrix Patent technology clustering Patent document clustering Patent document clusters KP14 KP3 KP10 Microsoft Visio KP6 KP12 KP1 KP2 KP14 KP3 KP10 Microsoft Visio KP6 KP12 KP1 KP2 KP14 KP3 KP10 KP2 KP14 KP3 KP12 KP32 KP18 KP6 KP12 KP1 KP2 KP14 KP3 KP10 KP2 KP14 KP3 KP12KP32 KP2 KP14 KP3 KP2 KP14 KP12 KP6 KP1 KP10 Step 4 Improvement and define ontological sub- domains Microsoft Visio Step 3 Step 5 TYPE I Ontology TYPE II Ontology TYPE III Ontology Step 1Patent domain definition IPDSS IPDSS Figure 11. Processing of ontology Step 3: Organize key phrases The key phrases from Table 10 are organized and grouped together according to their relationship and logical link, as shown in Table 11. Online dictionary WordNet (Princeton 34 University, 2011) and studying patent documents are used to understand the meaning of the key phrases. This procedure is based on personal experience and interpretation of words. The goal is to create 4-5 large groups of key phrases. For example, in Table 11, KP3, KP2, and KP6 are one group. The first draft of ontology is improved using the group key phrases to provide better concept relationship. Table 9. Patent and UPC matrix Patent1 Patent2 Patent3 Patent4 … Patentn NTF Rate (%) NTFR UPC UPC1 UPC1 UPC1 UPC2 … UPCz KP1 F1,1 F1,2 F1,3 .. .. … … … … KP2 F2,1 F2,2 … … … … … … … KP3 F3,1 … … … … … … … … … … … … … … … … … … KP50 … … … … … Fn50 … … … Note: UPCz = the UPC code of each patent Table 10. Key phrase and UPC matrix KP1 UPC1 UPC2 UPC3 .. … KP2 UPC1 UPC2 … … … KP3 UPC1 … … … … … … … … … … KP50 … … … … UPCn Note: Each column has the same classification code (IPC or UPC) 35 Table 11. Key phrase organization matrix KP3 UPC1 … … … … KP2 UPC1 UPC2 … … … KP6 UPC1 UPC2 UPC3 .. … … … … … … … KP50 … … … … UPCn Note: Each column has the same classification code (IPC or UPC) Step 4: Pre-define ontological sub-domain – TYPE II ontology The final step is to pre-define the key phrase groups in Table 11 by studying phrases, patent classification definitions, and patent technology to assign appropriate definition of the ontological sub-domains. The processing of these steps is shown in Figure 12. KP6 KP12KP1 KP2 KP14 KP3 KP10 KP6 KP12 KP2 KP14 KP3 KP10 KP2 KP14 KP3 KP12KP32 KP18 TYPE I ontology KP3 KP6 KP12 KP1 KP2 KP14 KP3 KP10 KP18 KP3 KP12 KP2 KP14 Patent domain KP1 KP12 TYPE II ontology Use tools to understand  WordNet  Patent classifications  Patent document  Your logic and experience MS Visio MS Visio MS Visio Figure 12. Processing TYPE II ontologies The following steps describe the construction of Type II ontology using MS Visio: 1. Use MS Visio to group key phrases according to groups in Step 3 for Type I ontology 2. Start with the patent domain at the center in MS Visio. 3. Use WordNet, patent classifications, patent document, and your logic to determine which key phrases are strongest associated with the patent domain. a. Start to link key phrases from the center (patent domain) and outwards using MS Visio. Ontology is hierarchical, use your logic and link the phrases of the 36 pre-defined sub-domains. (Note: Linking key phrases to create are reliant on patent domain, technology, and understanding which requires study of patent documents and domain). b. Use a color scheme to color the key phrases of pre-defined ontological sub- domains using MS Visio. Improvement of domain-specific ontologies This stage use methods to improve the Type II ontology, as shown in Figure 13, following steps describe the process to create Type III ontology. Step 1: New domain-specific patents Collect 50 new domain-specific patent documents from USPTO based on patent domain definition and use IPDSS to preprocess the data and carry out key phrase analysis process, patent technology clustering, and patent document clustering. Step 2: Patent technology clustering The correlation matrix derived from the key phrases correlation analysis in IPDSS is used as input for technology clustering. A feature of patent technology clustering is to discover the relationships of patents. Since the key phrases represent the concept or technology for each patent document, the key phrase correlation matrix and key phrases extracted are important for the key phrase collected. Patent technology clusters are generated by applying K-means algorithm of the key phrase correlation matrix. This technique can help researchers to select technology clusters to analyze, however, in this research it is used as an input for patent document clustering which is describe in the next step. Step 3: Patent document clustering The vector output of patent technology clustering is derived and used as an input for patent document clustering. A matrix is constructed as input for patent document clustering, as shown in Table 12. Patents under the same classification code can be entirely different and patent document clustering derives the internal relationship based on technologies. The output of this method is similar patens are clustered together to create a homogenous cluster. Therefore it is important and useful for researchers want to group technologies for analysis. Patents are clustered according to the formula below. ∑ ( ) ( ) 37 ( ) = The number of key phrases (belonging to TCm) that are included in patentn KPFm = The frequency of the key phrase m (belonging to TCm) of document j KP6 KP12 KP2 KP14 KP3 KP10 KP2 KP14 KP3 KP12KP32 Patent domain KP1 KP12 TYPE II ontology Use tools to understand  WordNet  Patent classifications  Patent document  Your logic and experience MS Visio Patent database (USPTO) Defined UPC Key phrase analysis process Step 2 Step 1 Domain-specific patents Patent documents Cluster 1 Patent documents Cluster 2 Key phrase analysis process Key phrase matrix of each cluster Patent technology clustering Patent document clustering KP6 KP12 Patent domain KP2 KP14 KP3 KP10 KP6 KP12 KP1 KP2 KP14 KP3 KP10 KP2 KP14 KP3 KP12KP32 KP2 KP14 KP3 KP2 KP14 KP12 KP6 KP1 KP10 Step 4 Improvement and define ontological sub-domains Step 3 Step 5 IPDSS Cluster 1 Cluster 2 Cluster 3 … Clustern KP1,1 KP1,2 KP1,3 .. KP1n KP2,1 KP2,2 … … … KP3,1 … … … … … … … … … … … … … KPn KP6 KP12 KP1 KP2 KP14 KP3 KP10 KP18 KP3 KP12 KP2 KP14 KP6 KP1 KP3 KP10 KP6 KP12 KP1 KP2 KP14 KP3 KP10 KP18 Cluster 1 KP3 Cluster 2 Etc. Cluster 1Cluster 4 Cluster 2Cluster 3 Use MS Visio to modify TYPE II ontology structure  Add new key phrases  Work from center and outwards  Color scheme  Define ontological sub-domains TYPE III Ontology Sub-domain Sub-domain Sub-domain Sub-domain Figure 13. Improvement steps of domain-specific ontologies 38 Step 4: Key phrase extraction of document clusters The patent document clustering creates clusters of the new patent documents according to step 3. The patent document clusters are separately subjected to the key phrase analysis process using IPDSS which applies NTFR and the top 15 NTFR-valued key phrases are extracted, tabulated as in Table 13. The key phrases are used to improve the pre-defined ontological sub-domain phrases. Table 12. Key phrases and patent correlation matrix Patent1 Patent2 Patent3 … Patentn TC1 N1,1 N1,2 N1,3 .. N1n TC2 N2,1 N2,2 … … … … … … … … … TCm … … … … Nnm Table 13. Key phrases of each cluster Cluster 1 Cluster 2 Cluster 3 … Clustern KP1,1 KP1,2 KP1,3 .. KP1n KP2,1 KP2,2 … … … … … … … … … … … … KPn Step 5: Modification and define ontology sub-domain – TYPE III ontology The key phrases collection of each individual cluster in Table 13 is used to compare with key phrases in the Type II ontology. The following steps describe the procedure. 39 1. Use MS Visio – add the key phrases from Table 13 to the Type II ontology a. Note: Do not link phrases yet, group phrases according to its clusters 2. Choose each cluster with key phrases from Table 13 to compare with the Type II ontological sub-domains. a. For example, key phrases by key phrase from cluster 1 in Table 13 are used to compare key phrase by key phrase of the Type II ontology. The more matching phrases, the better the cluster represent that sub-domain. b. Assign the best matching cluster to the Type II ontological sub-domains 3. Use WordNet, patent classifications, patent documents and you logic to understand as well as determine if these clusters from Table 13 are relevantly grouped with the Type II ontology a. Use MS Visio to rearrange key phrases of Type II ontology and modify the structure of the pre-defined sub-domains (if it makes sense) on the Type II ontology b. Use Ms Visio to start linking new key phrases from each assigned clusters in the sub-domains of the Type II ontology. Work from center and outwards. Try to create sub-domains and from each sub-domain create hierarchical tree structure. Comments: The pre-defined sub-domains from Type II ontology can be deleted and new sub-domain definition is created. New key phrases from Table 13 are also added and linked and the shared key phrases from other sub-domains. Shared key phrases are usually the most common key phrases in one patent domain which can help engineers to understand the domain concept better. c. Type II ontology structure is modified and colored with a color scheme according to the sub-domain definition (next step) 4. Define the ontological sub-domains which depends on previous step and each sub-domain (depends on cases to case) can be for example describing the main components of a technology and can be separated in several major parts. 40 3.5. Processing life-span analysis in patent clusters In this research, the procedure of clustering technology and life-span analysis is done by a case study. In this research, ontology is used to cluster patents according to its sub-domain- concepts by assigning each patent individually to each ontological sub-domain-cluster, as shown in Figure 14. Key phrases from each patent and compare with each sub-domain-cluster of the ontology which includes key phrases that are considered to be key concepts. Step 1: Test patents Thirty new domain test patents are downloaded from USPTO and processed in IPDSS. These test patents are new and have not been used in training the system, build or improved the ontology. Patent classifications on these patents are not extracted restricted. Step 2: Key phrase analysis process IPDSS apply NTFR-methodology to analyze key phrases of the patent documents and the output is key phrase matrix. The key phrase matrix lists the frequencies of all key phrases for each patent. Step 3: Ontological sub-domain clustering The list of key phrases for each patent is compared with the sub-domains of the Type III ontology. Key phrase by key phrase is compared with the sub-domain key phrases. The patents are assigned to that specific sub-domain if the key phrases describe the concepts and relationship of the sub-domain ontology most consistent. The patents assigned to each sub- domain are clustered together and this is called ontological sub-domain clustering. The clusters are named after the sub-domain definition. Step 4: Life-span analysis For each ontological sub-cluster, the age of the patent is calculated from the filing date as a starting date, not issuing date, and up to today. Patents are protected from the filing date and when granted it is called issue date. It can take up to two or three years before it is issued. The average age is the sum of each patent age divided by the number of patents in the sub-domain cluster and an example of the information is shown in Table 14. Average age is calculated according to the formula below: ∑ where n = the total number of patents in a sub-domain 41 Patent database (USPTO) Patent document collection Step 2 Key phrase analysis process Key phrase matrix Step 3 Life-span analysis Step 4 Ontological sub- domain clustering Sub-domain cluster 3 O n to lo g ic al s u b -d o m ai n s Average life-span of dental patent clusters Sub-domain cluster 1 Sub-domain cluster 4 Sub-domain cluster 2 Mature or decline stage Introductory or growth stage Patent1 Patent2 Patent3 … Patentn NTFR KP1 F1,1 F1,2 F1,3 .. … … KP2 F2,1 F2,2 … … … … KP3 F3,1 … … … … … … … … … … … … KP50 … … … … Fn50 … IPDSS KP6 KP3 KP1 0 KP 6 KP 1 KP1 KP 10 KP 18 Patent 1 KP3 Patent 2 KP6 KP12 Patent domain KP2 KP14 KP3 KP10 KP6 KP12 KP1 KP2 KP14 KP3 KP10 KP2 KP14 KP3 KP12KP32 KP2 KP14 KP3 KP2 KP14 KP12 KP6 KP1 KP10 TYPE III Ontology Sub-domain 2 Sub-domain 4 Etc. KP 19 Compare key phrases of each individual patent with sub-domains of the Type III ontology  Assign patents to sub-domains if key phrases of individual patents match concepts of sub-domains Sub-domain 3 KP6 KP1 0 KP 6 KP 1 KP 18 Patent 1 KP3 Patent 2 Sub-domain 1 KP6 KP1 0 KP 6 KP 1 KP 18 Patent 3 KP3 Patent 4 KP6 KP1 0 KP 6 KP 1 KP 18 Patent 7 KP3 Patent 8 KP6 KP1 0 KP 6 KP 1 KP 18 Patent 5 KP3 Patent 6 The patents assigned to each sub-domain are clustered together, this is called ontological sub-domain clustering  The patents in each sub-domain forms a cluster named after the sub- domain definition The average age of each sub-domain cluster is calculated  Each sub-domain cluster is plotted against the average age of the cluster Figure 14. Processing of life-span analysis 42 Table 14. Patent information and average age Sub-domain 1 Patent No. Patent title (PT1) UPC Filing date Age P1 PT1 UPC1; UPC2; etc. Month, Year Age P2 PT2 UPC1; UPC2; etc. Month, Year Age Average age AA The average age of each cluster is plotted against the ontological sub-domain clusters, Figure 15 illustrate the analysis of potential emerging or declining clusters depending on its average age. The size of each bubble represents the number of patents, Y-axis is ontological sub-domain clusters and X-axis is the average age starting from 0-20 years (from right to left on the X-axis). Cluster 4 on Figure 15 represents a young cluster of a specific ontological sub- domain, in other words, a specific sub-domain technology in dental implants. This mapping method allows researchers to explore which