with the aid of geoelectrical - imaging techniques and geographical information systems Master’s Thesis in the Master Degree Programme, Civil Engineering KEVIN HINE Department of Civil and Environmental Engineering Division of GeoEngineering Engineering Geology Research Group CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden 2005 Master’s Thesis 2005:22 Chromatin pattern analysis of cell nuclei for improved cervical cancer screening Master’s Thesis in Biomedical Engineering 0 5 10 15 20 25 30 35 400 200 400 600 800 1000 1200 1400 1600 1800 Ramin Moshavegh Babak Ehteshami Bejnordi Department of Signals & Systems Chalmers University of Technology Gothenburg, Sweden 2013 Thesis number: EX030/2013 Chromatin pattern analysis of cell nuclei for improved cervical cancer screening Master’s Thesis in Biomedical Engineering Ramin Moshavegh Babak Ehteshami Bejnordi Thesis supervisor and examiner: Assistant Professor Andrew James Heinrich Mehnert Image analysis Chalmers University of Technology Department of Signals and Systems SE-412 96 Gothenburg Sweden Thesis number: EX030/2013 ISSN: 99-2747920-4 Abstract The Papanicolaou (Pap) test or Pap smear is the primary screening test for cervical cancer. It involves the microscopic examination of cells sampled from the cervix. Two major factors affect the accuracy of the Pap test. The first is sampling error wherein no diagnostic cells make it on to the slide. The other is interpretation error for reasons including fatigue, inexperience, and habituation. Whilst computer-assisted interpreta- tion can potentially address the issue of interpretation error it cannot address sampling error. However the malignancy-associated change (MAC) phenomenon may offer a solu- tion. MACs are subtle sub-visual changes in the appearance of otherwise normal-looking cells from an abnormal Pap smear slide. An essential first step in the development of an automated screener, based on MACs, is robust automatic segmentation of free-lying cell nuclei in digitized Pap smear images. This thesis presents and evaluates a fully automated algorithm for robustly detecting and segmenting free-lying cell nuclei in bright-field microscope images of Pap smears. The proposed novel segmentation algorithm makes use of grey-scale annular closings to identify free-lying nuclei-like objects together with marker-controlled watershed seg- mentation to accurately delineate the nuclear boundaries. The method was evaluated empirically using images digitised from Pap smears sourced from the Regional Cancer Centre in Thiruvananthapuram in India. The results show that the sensitivity and speci- ficity of nucleus detection is 94.71% and 85.30% respectively, and that the accuracy of segmentation, measured using the Dice coefficient, of the detected nuclei is 97.30±1.3%. This thesis also presents and evaluates a set of novel structural texture features for quantifying and classifying nuclear chromatin patterns in cells on a conventional Pap smear. The experimental results demonstrate the efficacy of the proposed structural approach and that a combination of the structural texture features and conventional features can be used to discriminate between normal and abnormal slides with high accuracy (0.954±0.019 AUC±SE ). They also demonstrate that it is possible to detect MACs in Papanicoloau stain (which is not stoichiometric). This in turn suggests the possibility of developing a fully automated Pap smear screener based on MACs. , Signals and Systems, Master of Science Thesis 2013 i Acknowledgements First and foremost we would like to express our deepest appreciation to our supervi- sor Assistant Professor Andrew Mehnert, whose guidance, stimulating suggestions and encouragement, helped us to coordinate the thesis and write this report. We also extend a big thank you to our co-supervisors Professor Ewert Bengtsson and Patrik Malm, CBA Uppsala, for their guidance and support and also for providing the data needed. We express our gratitude to MedTech West and its board for providing an inspiring research environment. Furthermore we would also like to acknowledge, with much appreciation, our fel- low researchers at MedTech West, Yazdan Shirvany, Qaiser Mahmmod and Mohammad Alipoor for their friendship and support. Finally, we thank our families, and especially our parents, for their unflagging love and support throughout our life; this thesis would have been impossible without them. The Authors, Gothenburg, Sweden 2013/03/20 , Signals and Systems, Master of Science Thesis 2013 iii Contents 1 Introduction 1 1.1 The Pap smear test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Shortcomings of the Pap smear test . . . . . . . . . . . . . . . . . 3 1.1.2 Automated screening . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Malignancy associated changes (MACs) . . . . . . . . . . . . . . . . . . . 4 1.3 Aim and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Literature review 9 2.1 Review of existing methods for segmenting cells and nuclei in Pap smear images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Review of existing features devised to quantitatively characterise chro- matin distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Review of chromatin texture features . . . . . . . . . . . . . . . . . 12 3 Novel algorithm for segmenting free-lying cell nuclei 15 3.1 High-level description of the proposed algorithm . . . . . . . . . . . . . . 15 3.2 Extraction of inner markers (step 1) . . . . . . . . . . . . . . . . . . . . . 16 3.3 Marker-controlled watershed segmentation of the detected nucleus-like ob- jects (steps 2-3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Artefact Rejection (steps 4-5) . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.1 Size criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.2 Shape criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4.3 Texture criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Empirical evaluation of the proposed segmentation algorithm . . . . . . . 28 3.5.1 Image data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5.2 Method and Experiment . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 v CONTENTS 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4 Texture features characterising chromatin distribution 35 4.1 Chromatin segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Proposed structural texture features . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Graph-based features . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Margination features . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.3 Clustering features . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.4 Contextual chromatin particle features . . . . . . . . . . . . . . . . 41 4.2.5 Statistics of chromatin particle morphometric features . . . . . . . 41 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 Evaluation of the proposed texture features for screening Pap smear slides 45 5.1 Pap smear images and ground truth . . . . . . . . . . . . . . . . . . . . . 45 5.2 Feature selection methods considered in this study . . . . . . . . . . . . . 46 5.2.1 The curse of dimensionality and the need for feature selection . . . 46 5.2.2 MSVM-RFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.2.3 Guided Regularized Random Forest (GRRF) . . . . . . . . . . . . 47 5.2.4 L1-regularization procedure with generalised linear model . . . . . 48 5.3 Experiment I: Determination of the most discriminatory subset of features for the classification of Pap smears . . . . . . . . . . . . . . . . . . . . . . 48 5.3.1 Cytology slides for experiment I . . . . . . . . . . . . . . . . . . . 49 5.3.2 Feature extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.3.3 Feature selection and classification . . . . . . . . . . . . . . . . . . 50 5.3.4 Results of experiment I . . . . . . . . . . . . . . . . . . . . . . . . 51 5.4 Experiment II: MAC-based classifiers for Pap smear screening . . . . . . . 52 5.4.1 Cytology slides for experiment II . . . . . . . . . . . . . . . . . . . 53 5.4.2 Classification performance evaluation of the best subset of features derived in experiment I . . . . . . . . . . . . . . . . . . . . . . . . 53 5.4.3 Results of experiment II . . . . . . . . . . . . . . . . . . . . . . . 54 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6 Summary and Conclusion 57 6.1 Thesis summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 6.2 Key contributions and findings . . . . . . . . . . . . . . . . . . . . . . . . 58 6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 6.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6.5 Opportunities for further research . . . . . . . . . . . . . . . . . . . . . . . 61 Appendices 63 vi , Signals and Systems, Master of Science Thesis 2013 CONTENTS A Fundamental operations from Mathematical Morphology 64 A.1 Mathematical morphology for binary images . . . . . . . . . . . . . . . . . 64 A.2 Mathematical morphology for grey-scale images . . . . . . . . . . . . . . . 65 B Segmentation methods underlying methods in Chapters 2 and 3 66 B.1 Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 B.2 Active contours and Deformable models . . . . . . . . . . . . . . . . . . . 67 B.3 Seeded Region Growing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 B.4 Watershed Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 C Cytology features 75 D Candidate cell nuclei marker extraction methods 76 D.1 Morphological reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . 76 D.1.1 Grey-scale reconstruction: . . . . . . . . . . . . . . . . . . . . . . . 76 D.1.2 Regional minima: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 D.1.3 H-extrema: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 D.1.4 Morphological Top-hat Transform . . . . . . . . . . . . . . . . . . 79 D.1.5 Improved Morphological Top-hat Transform . . . . . . . . . . . . . 80 E Shape Criteria 82 E.1 Danielsson’s G shape factor . . . . . . . . . . . . . . . . . . . . . . . . . . 82 E.2 Dice similarity coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 E.3 Eccentricity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 F Classifiers used in the experiment II 85 F.1 SVM classifier with a linear kernel . . . . . . . . . . . . . . . . . . . . . . 85 F.1.1 Theory of linearly separable binary classification . . . . . . . . . . 85 F.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 G Publications 91 Bibliography 109 , Signals and Systems, Master of Science Thesis 2013 vii List of Figures 1.1 Data acquisition and nucleus segmentation in an automated screening system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Extracting an inner marker for a free-lying cell nucleus. . . . . . . . . . . 17 3.2 Segmentation of the detected nuclei-like objects. . . . . . . . . . . . . . . 19 3.3 Cells of the squamous epithelium . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Ellipse fitting to an arbitrary shape . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Computing averages at different scales . . . . . . . . . . . . . . . . . . . . 23 3.6 Measure of Tamura coarseness for 6 different cell nuclei. . . . . . . . . . . 24 3.7 Artefacts rejected by the Tamura coarseness feature . . . . . . . . . . . . 24 3.8 Two sample FOVs from a Pap smear slide . . . . . . . . . . . . . . . . . . 25 3.9 GUI for Ground truth generation . . . . . . . . . . . . . . . . . . . . . . . 29 3.10 GUI for Ground truth generation . . . . . . . . . . . . . . . . . . . . . . . 30 3.11 DSC scores for the evaluation of the proposed automatic segmentation . . 32 3.12 Examples of false positive detection by the algorithm . . . . . . . . . . . . 33 3.13 missed nuclei due to presence of background noise and watershed failure . 34 3.14 Missed nuclei due to unsuitable choice of parameter values . . . . . . . . . 34 4.1 Representation of a nucleus as a topographic relief (grey-level represents height) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Chromatin segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Illustration of how the area-Voronoi diagram and generalised Delaunay graph are computed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Graphs related to the Delaunay graph . . . . . . . . . . . . . . . . . . . . 38 4.5 Illustration of how the features M1 and M2 are computed. . . . . . . . . 39 4.6 Illustration of how the features C1 and C2 are computed. . . . . . . . . . 40 4.7 Illustration of how the features C3 and C4 are computed. . . . . . . . . . 41 5.1 Pie chart showing the contribution of the four main object feature cate- gories in this study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 viii LIST OF FIGURES 5.2 Illustration of 100 times iterated 5-fold cross validation feature selection . 51 5.3 Boxplots of the features F1, F2, F3, F4, F5, F6 for experiment 1. . . . . . 52 5.4 An example of a digitalized Pap smear FOV . . . . . . . . . . . . . . . . . 54 5.5 Boxplots of the features F1, F2, F5, F6 for experiment II. . . . . . . . . . 55 6.1 Focus stacking for extended depth of field in optical microscopy. . . . . . 60 B.1 Global Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 B.2 Parametric representation of a contour . . . . . . . . . . . . . . . . . . . . 69 B.3 Active contour segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 70 B.4 Cell nuclei segmentation by GVF snake. . . . . . . . . . . . . . . . . . . . 71 B.5 Simulating the flooding of a topographic surface . . . . . . . . . . . . . . 73 B.6 Computing the magnitude of gradient image: Watershed transform con- siders the gradient image as a topographic surface. . . . . . . . . . . . . . 74 D.1 Illustration of regional minimum at level 2 in an image . . . . . . . . . . . 78 D.2 H-maxima transformation on a 1-D signal and extracting the regional maxima by computing f −Rδf (f − h). . . . . . . . . . . . . . . . . . . . . 78 D.3 H-minima transformation on a 1-D signal and extracting the regional min- ima by computing Rεf (f + h)− f . . . . . . . . . . . . . . . . . . . . . . . . 79 D.4 Morphological Top-hat Transform on an image . . . . . . . . . . . . . . . 80 E.1 G Shape-factor calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 E.2 Parametric description of an ellipse . . . . . . . . . . . . . . . . . . . . . . 83 F.1 SVM for two-class data classification by a hyperplane . . . . . . . . . . . 86 F.2 Plot of a logistic function . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 F.3 Two-class data classification by a linear logistic regression decision boundary 89 , Signals and Systems, Master of Science Thesis 2013 ix List of Tables 2.1 Comparison between cell and cell nucleus segmentation approaches found in the literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 Proposed algorithm to detect and segment free-lying intermediate cell nuclei 16 3.2 Shape and granularity features for 6 different cell nuclei. . . . . . . . . . . 26 3.3 Shape and granularity features for 6 non-nuclei artefacts . . . . . . . . . . 27 4.1 Summary of the extracted nucleus features . . . . . . . . . . . . . . . . . 42 5.1 Frequency table for top 5 ranking features in experiment I. . . . . . . . . 51 5.2 Classification results for experiment I. . . . . . . . . . . . . . . . . . . . . 52 5.3 Classification results for experiment II. . . . . . . . . . . . . . . . . . . . . 55 x 1 Introduction C ervical cancer is the third most diagnosed cancer and the fourth leading cause of cancer death among women worldwide accounting for 9% of all ma- lignancies among females in 2008 [1]. More than 85% of the cervical cancer cases occur in developing countries where public health infrastructure does not support Papanicolaou testing [1]. In countries with developed healthcare systems, widespread cervical screening programmes, aimed at detecting precancerous changes that can then be treated to prevent invasive cancer, have significantly reduced the number of deaths from the disease. The Papanicolaou (Pap) smear, is the primary screening test for cervical cancer. It has been largely responsible for diagnosing cancerous and precan- cerous lesions in many developed countries [2]. There is compelling evidence showing the efficacy of organised screening programmes based on performing the Pap smear test every 3-5 years. In some of the Nordic countries the incidence of invasive cervical cancer has declined by 80% since the introduction of organised screening programmes [3]. In Sweden the incidence of cervical cancer has decreased by 65% during the last 40 years but the incidence of invasive cervical cancer and cervical cancer mortality figures have been rather stable over the last decade [4]. Fortunately, invasive cervical cancer takes years to develop from slowly progressing precancerous lesions, thus enabling early detection through screening and diagnostic tests. Currently screening by conventional cervical cytology (Pap smear test) remains the principal global strategy to prevent invasive cervical cancer. However, the Pap smear test has several shortcomings including: subjective nature (dependent on individual interpretation), low sensitivity (i.e. ability to detect abnormal changes) and the need for frequent retesting [5]. The remainder of the present chapter is organised as follows: Section 1.1 describes the Pap smear test and discusses its shortcomings, and then presents the rationale for automated screening of Pap smear slides and describes the problems inherent in current automated screening systems. Section 1.2 then discusses a phenomenon known as ma- 1 CHAPTER 1. INTRODUCTION lignancy associated changes (MACs), which can offer a solution to the automated Pap smear screening problem. The aim and objectives of the thesis are presented in section 1.3 and section 1.4 outlines the scope of the research. Finally, section 1.5 outlines the structure of this thesis. 1.1 The Pap smear test The Papanicolaou (Pap smear) test is the most cost-effective cancer prevention and detection program ever invented [6]. It was devised by George N. Papanicolaou. In 1928 he first presented his findings that malignant cells from the cervix can be detected in a small sample of cells collected from the cervix (vaginal smears) [7]. Over a decade passed before collaboration between Dr. Papanicolaou and Dr. Her- bert Traut, a gynaecologist and pathologist, provided scientific evidence of the potential of vaginal smears for the identification of cervical cancer and precancerous changes. Traut provided Papanicolaou with a large number of clinical samples from female patients at Cornell’s Hospital. Papanicolaou published a detailed description of pre-invasive cervical lesions in his treatise, “Diagnosis of uterine cancer by the vaginal smear” [8] and in a major paper [9]. The conventional Pap test is a simple procedure comprising the following steps [6]: • A speculum is inserted into the vagina to widen the opening so that the cervix can be viewed; • Cells are sampled from inside and around the cervix using a swab, brush, or spatula; • Cells are pressed on a glass slide and a fixative (preservative) is applied to preserve the sample; • The samples are stained to ameliorate the contrast in the specimen and highlight the structural patterns to be analysed with a light microscope in a cytology labo- ratory [10, 11]. Liquid-based cytology (LBC) is a new method for cellular sample preparation for cytological tests. The main difference between LBC and the conventional Pap smear is related to the underlying preparation technique [12]. The sample is collected in a similar way to the Pap smear, but rather than smearing the cells onto a glass slide, the cellular material is immediately rinsed into a preservative liquid solution. The sample is then sent to the laboratory where special filtering techniques are used to remove non- diagnostic materials such as mucus, pus and blood cells. In the next step, a thin layer of cells is deposited onto a slide. Finally, the slide is examined under the microscope by a cytologist in the same way as in the conventional smear test [12]. 2 , Signals and Systems, Master of Science Thesis 2013 1.1. THE PAP SMEAR TEST 1.1.1 Shortcomings of the Pap smear test It usually takes several days or weeks to prepare the final results of a Pap smear test. Doctors and nurses sample the cervix and send the specimen to a pathology laboratory for visual evaluation under a microscope [13]. The microscopic examination itself is laborious and time-consuming involving the review of possibly hundreds of thousands of cells for signs of cancer or precancer. It is not surprising therefore that 1 in every 10 to 20 positive cases is missed in routine screening [5]. Two major factors affect the accuracy of the Pap smear test. The first is sampling error wherein no diagnostic cells make it onto the slide. This occurs when health care providers fail to adequately sample the cervix (failing to sample precancerous/cancerous cells when they are present). It also occurs when the precancerous/cancerous cells on the collecting device do not make it onto the glass slide. The second factor affecting the ac- curacy of the Pap smear test is interpretation error by the laboratory specialist (for any of a number of reasons including fatigue, inexperience, and habituation). These short- comings have motivated the research and development of automated screening systems [5, 13]. 1.1.2 Automated screening Automated screening machines can analyse Pap smear slides in a short time without fatigue, providing consistent and objective classification results. The rationale for auto- mated screening is to improve the limitations of the conventional Pap smear test in the following ways [14]: • Increase the sensitivity1 and specificity2 of the Pap smear test; • Decrease the workload of technicians and pathologists; • Reduce the cost for cervical cancer screening programmes; and • Lower the probability of incidence of cervical cancer and the mortality rate from the disease. At present there are two FDA-approved cervical cancer screening systems: BD Focal- Point GS Imaging System (formerly known as TriPath AutoPap system) and HOLOGIC ThinPrep Imaging System (formerly known as Cytyc Thin Prep Imaging System) [15]. The FocalPoint GS Imaging System works on LBC slides and any Pap stained con- ventional smear. The machine scores and ranks the slides based on the likelihood of abnormality and categorizes them into four groups: review, no further review (NFR), process review and quality control (QC) review [15]. Among the analysed slides the NFR group, which comprises up to 25% of the total qualified slides, can be archived without human review. The ThinPrep AutoPap system, by contrast, can only work on ThinPrep 1The sensitivity measures the proportion of actual positives that are correctly classified. 2The specificity measures the proportion of negatives which are correctly classified. , Signals and Systems, Master of Science Thesis 2013 3 CHAPTER 1. INTRODUCTION LBC slides with a special stain which is nearly stoichiometric1 for DNA content. This system automatically selects 22 field of view (FOV) images that are of diagnostic interest for cytopathologists, from among a total of 120 FOVs acquired on each slide. These 22 FOVs must then be fully reviewed by a cytopathologist. The system provides no scoring or ranking of the slides [15]. Whilst automated screening systems can reduce false negatives attributed to inter- pretation errors, they cannot reduce false negatives due to sampling errors. Research [16] suggests that a phenomenon known as malignancy-associated changes (MACs) may offer a solution. 1.2 Malignancy associated changes (MACs) The expression malignancy associated changes (MACs) was coined by Nieburgs et al. (1959) [17]. They reported subtle changes in nuclei of apparently normal looking cells “adjacent to or distant from malignant tumours” [18]. Research in the 1980s identified sub-visual alterations in intermediate cells from cervical atypical smears [19]. The development of an automated screener based on detecting MACs can potentially overcome the problem of sampling error because it is not necessary to perform an ex- haustive review of all of the cellular material to identify diagnostic cells but rather to look for subtle nuclear texture changes in a sub-population of cells sampled from the slide. Based on the available literature, the most effective MAC features seem to be tex- tural features [20]. To date, most of the approaches for defining nucleus texture have been based on stochastic approaches. However, the work of Mehnert [20] suggests that an alternative structural approach to defining texture, which corresponds well to what cytopatologists perceive from cell nuclei, is more effective to describe the chromatin2 distribution inside the nuclei. 1.3 Aim and objectives The aim of this research was to fully explore the structural approach to chromatin pattern description and to evaluate the efficacy of the features derived from it for discriminating between normal and abnormal Pap slides. The research had the following objectives: 1. To develop a robust algorithm for detecting and segmenting cell nuclei in digitized Pap smear images obtained using bright-field microscopy; 2. To develop structural texture features that quantitatively characterise the pattern (arrangement, size, shape, etc.) of the nuclear chromatin; 1Stains for which the amount of stain uptake in the nucleus is proportional to the amount of DNA are Stoichiometric stains. 2Chromatin is the combination of DNA and proteins that make up the contents of the nucleus of a cell. 4 , Signals and Systems, Master of Science Thesis 2013 1.4. SCOPE 3. To determine the most discriminatory subset of features for discriminating between normal and abnormal slides using real clinical data; and 4. To evaluate the performance of a classifier(s), based on the selected features, using real clinical data. 1.4 Scope As noted in section 1.3, MACs are subtle sub-visual alterations in the appearance of normal looking cells from an abnormal Pap smear. The features which appear to have the most discriminatory power are nuclear texture features [21]. These features reflect chromatin structure and characterise the distribution of chromatin inside the nucleus. The MAC approach to analysing Pap smear slides is conceptually straight forward (see Figure 1.1). The process involves automatically capturing digital images of individual FOVs from a Pap smear slide, identifying the location of nucleus like objects (scene segmentation), segmenting the nuclei like objects in the image (nucleus segmentation), extracting quantitative texture and other features for each nucleus and finally classifying the slides as either normal or abnormal based on these features. The proposed MAC-based cervical screening approach has the following steps: 1. Scanning the Pap-stained slide using a light microscope coupled with a CCD cam- era with multiple objectives. 2. Capturing multiple images at different focal planes from interesting fields of view (FOVs) on the slide. 3. Generating extended depth-of-field (EDF) images for each FOV. This involves combining multiple focal planes for each FOV to obtain a single image where each object is all in focus. 4. Locating and segmenting the free-lying cell nuclei in each EDF image, and per- forming artefact rejection to make sure only nuclei-like objects are retained; 5. Segmenting the nuclear chromatin inside each nucleus into texture primitives (blobs); 6. Extracting features from this structural model to quantitatively characterise the chromatin pattern; 7. Deriving slide-based features from the features in 6 in order to classify the slide as normal or abnormal. This thesis does not consider the slide scanning and FOV acquisition steps (steps 1 and 2). , Signals and Systems, Master of Science Thesis 2013 5 CHAPTER 1. INTRODUCTION Figure 1.1: Data acquisition and nucleus segmentation in an automated screening system. (a) The cytometer1 scans the Pap-stained slide and captures scenes from the deposition area on the slide in a predefined way. (b) The location of the cervical epithelial nuclei are identified inside each microscope field of view. (c) Each nucleus-like object is then segmented by defining its boundaries. 1The cytometer utilises a CCD camera mounted on a light microscope (fitted with a 40× objective lens) to acquire 8-bit images of Papanicolaou-stained cells. 6 , Signals and Systems, Master of Science Thesis 2013 1.5. STRUCTURE OF THE THESIS 1.5 Structure of the thesis This chapter has: • Provided an overview of the Pap smear test, including its shortcomings, and dis- cussed the rationale for automated screening of Pap smear slides. • Described the malignancy associated changes (MACs) phenomenon and its rele- vance to automated screening, in particular in addressing the problem of sampling error. • Defined the aim, objectives, and scope of this research. The remainder of the thesis is organised as follows: Chapter 2 This chapter presents two literature reviews pertinent to chapter 3 and chapter 4. The first is a review of the cell and cell nucleus segmentation meth- ods in Pap smear images published in the literature. The second is a review of the texture features available in the literature devised to quantify chromatin tex- ture/distribution. Chapter 3 This chapter deals specifically with the problem of accurately and robustly segmenting the cervical cell nuclei in digitized light microscopy images of Pap smears. A novel algorithm is developed to detect and segment free-lying interme- diate cell nuclei. The algorithm includes three main steps of locating the free-lying nuclei, delineating nuclei and rejecting the artefacts. This chapter also presents an empirical evaluation of the proposed segmentation algorithm for both detection and delineation of free-lying nuclei in Pap smear images. Chapter 4 This chapter deals with the problem of quantitative characterisation of chro- matin texture and presents a set of novel structural texture features to describe nuclear chromatin patterns in cells on a conventional Pap smear. These features are derived from a segmentation of the chromatin into blob-like primitives. The proposed set of features are, in particular, derived from statistics of morphometric features and contextual features computed for these blobs. Chapter 5 This chapter presents an evaluation of the performance of the proposed structural chromatin texture features. In particular, it presents an investigation of the most discriminatory subset of features, from among the proposed features and a wide range of features drawn from the literature, for discriminating between normal and abnormal Pap smears using the MAC approach. The chapter presents the details of the two experiments carried out in this study. The first is a fea- ture selection experiment performed to obtain the most discriminatory subset of features. The second experiment is to evaluate the performance of a variety of clas- sifiers built using the feature subset obtained in the first experiment to discriminate between the normal and abnormal slides. , Signals and Systems, Master of Science Thesis 2013 7 CHAPTER 1. INTRODUCTION Chapter 6 This chapter reviews the work that is presented in this thesis and sum- marises the major contributions and findings. In addition, it outlines the limita- tions of the research undertaken, and proposes an avenue of future research. The material presented in chapters 3, 4 and 5 have been published in the Proceedings of the 2012 Annual International Conference of the IEEE Engineering in Medicine and Bi- ology Society (EMBC 2012) [22] and the Proceedings of the 2013 SPIE Medical Imaging Conference [23]. 8 , Signals and Systems, Master of Science Thesis 2013 2 Literature review T his chapter presents two literature reviews pertinent to chapter 3 and chapter 4. The first is presented in Section 2.1. It is a review of the cell and cell nucleus segmentation methods in Pap smear images published in the literature. The conclusions motivate the method proposed in chapter 3. The second is presented in section 2.2. It is a review of the standard texture features available in the literature devised to quantify chromatin texture/distribution. 2.1 Review of existing methods for segmenting cells and nuclei in Pap smear images This section provides a review of the existing literature on the topic of automated de- tection and segmentation of cells and cell nuclei in Pap smear images. The following scientific databases were searched: IEEEXplore1, Inspec2, Medline3, ScienceDirect4. The main keywords chosen were: Segmentation, cell segmentation, nuclei segmentation, Pap smears. Image segmentation is the process of partitioning an image into sub-images corresponding to the objects of interest and background. In general, automated segmen- tation is one of the most difficult tasks in image processing. Numerous algorithms have been published in the literature for segmenting cells and cell nuclei in microscopy images. They can be categorized according to the primary underlying segmentation methodol- ogy used: global and adaptive thresholding [24], watershed transform [24, 25], boundary detection algorithms and deformable models [26, 27, 28], and edge enhancement based techniques [29, 30]. These methods are summarized in table 2.1. 1http://ieeexplore.ieee.org 2http://www.engineeringvillage.com 3http://www.ncbi.nlm.nih.gov/pubmed 4http://www.sciencedirect.com 9 CHAPTER 2. LITERATURE REVIEW The basic methods underlying the existing cell and cell nucleus segmentation al- gorithms mentioned in table 2.1 are presented in Appendix B. These methods employ fundamental concepts and elements of mathematical morphology (see Appendix A). An objective comparison between the performances of these methods is difficult, either because no quantitative analysis of performance is provided or because widely different evaluation methodologies (including data and methods) have been used. The lack of information about how some parameters are derived further complicates the assessment of these methods. Table 2.1 shows a detailed comparison of the cell and cell nuclei segmentation methods in Pap smear images available in the literature. The watershed approach to segmentation has proved to be a powerful and fast seg- mentation technique for both object boundary delineation and region-based segmenta- tion. Simplicity, speed and complete division of the image are the properties that make the watershed transform a popular method for many different image segmentation appli- cations. The method has been frequently applied to biological images and has produced good results. Watershed-based methods have also been applied for segmentation of clus- tered cell nuclei [25, 31]. However, these methods usually lead to over-segmentation. In order to alleviate this issue, heuristic rules are defined (as a post-processing step) to merge the over-segmented regions to produce the final segmented image. The alternative approach is to use marker-controlled watershed, which effectively handles the problem of over segmentation [32]. The method requires that the extracted markers represent true cell nuclei. In the following chapter we present a novel algorithm for extracting markers of candidate free-lying nuclei-like objects for subsequent marker-controlled watershed segmentation to obtain the nucleus boundaries. 10 , Signals and Systems, Master of Science Thesis 2013 2.1. REVIEW OF EXISTING METHODS FOR SEGMENTING CELLS AND NUCLEI IN PAP SMEAR IMAGES T a b le 2 .1 : C om p ar is on b et w ee n ce ll a n d ce ll n u cl eu s se g m en ta ti o n a p p ro a ch es fo u n d in th e li te ra tu re A u th o r S e g m e n ta ti o n M e th o d Im a g e s Im a g e si z e N u m b e r o f c e ll s Q u a n ti ta ti v e m e a su re o f p e rf o rm a n c e B a m fo rd et a l. [2 6 ] A ct iv e co n to u rs fo r n u cl eu s se g m en ta ti o n 2 0 1 3 0 1 2 8 × 1 2 8 2 0 1 3 0 (S in g le ce ll im a g es ) V is u a l ev a lu a ti o n : 9 9 .6 % o f th e P a p st a in ed ce ll im a g es w er e co rr ec tl y se g m en te d . W u et a l. [2 8 ] T em p la te fi tt in g fo r n u cl eu s se g m en ta ti o n 1 8 0 × 1 0 0 1 M is cl a ss ifi ca ti o n ra te o f th e p a ra m et ri c fi tt in g a p - p ro a ch fo r th e sy n th et ic ce ll im a g e≤ 5 % G a rr id o et a l. [2 7 ] T em p la te fi tt in g fo r n u cl eu s se g m en ta ti o n 3 U n k n o w n U n k n o w n N o Q u a n ti ta ti v e m ea su re s L ez o ra y et a l. [2 5 ] W a te rs h ed tr a n sf o rm fo r n u cl eu s se g m en ta ti o n 1 0 U n k n o w n 2 0 9 V in et m ea su re * : 2 .2 4 a n d 3 .4 1 in R G B a n d H S L (H u e, S a tu ra ti o n , L u m in a n ce ) co lo r sp a ce . M ea n d iff er en ce fr o m th e g ro u n d tr u th : 2 .8 % (R G B ) - 0 .4 7 % (H S L ). L a ss o u a o u i et a l. [3 3 ] G en et ic a lg o ri th m fo r ce ll se g m en ta ti o n 2 2 5 6 × 2 5 6 U n k n o w n N o Q u a n ti ta ti v e m ea su re s Y a n g -M a o et a l. [2 9 ] E d g e d et ec to r fo r n u cl eu s a n d cy to p la sm se g m en ta ti o n U n k n o w n 6 4 × 6 4 1 2 4 (s in g le ce ll im a g es ) T h e a v er a g e se g m en ta ti o n er ro rs (M H D , E M M , R A E , M E )* * o f 0 .1 5 2 3 a n d 0 .0 7 7 5 fo r n u cl eu s a n d cy to p la sm se g m en ta ti o n , re sp ec ti v el y. L in et a l. [3 0 ] E d g e d et ec to r fo r n u cl eu s a n d cy to p la sm se g m en ta ti o n 1 0 U n k n o w n 1 0 (s in g le ce ll im a g es ) A v er a g e se g m en ta ti o n er ro r (t h e a v er a g e o f M E , R A E , a n d M H D ) o f 0 .1 3 2 3 fo r n u cl eu s se g m en ta - ti o n . P li ss it ie t a l. [3 4 ] E d g e d et ec to r fo r n u cl eu s se g m en ta ti o n 3 8 1 5 3 6 × 2 0 4 8 5 6 1 7 S en si ti v it y : 9 0 .5 7 % fu zz y C -m ea n s (F C M ), 6 9 .8 6 % su p p o rt v ec to r m a ch in e (S V M ) S p ec ifi ci ty : 7 5 .2 8 % (F C M ), 9 2 .0 2 % (S V M ) * V in e t m e a su re [3 5 ] is u se d to q u a n ti fy th e d is ta n c e b e tw e e n tw o se g m e n ta ti o n s a n d c o rr e sp o n d s to th e c o rr e c t c la ss ifi c a ti o n ra te . * * M is c la ss ifi c a ti o n e rr o r (M E ), e d g e m is m a tc h (E M M ), re la ti v e fo re g ro u n d a re a e rr o r (R A E ), a n d m o d ifi e d H a u sd o rff d is ta n c e (M H D ), a re o ft e n u se d a s se g m e n ta ti o n p e rf o rm a n c e m e a su re s v a ry in g fr o m 0 fo r a n a b so lu te ly c o rr e c t se g m e n ta ti o n to 1 fo r a to ta ll y e rr o n e o u s c a se [3 6 ]. , Signals and Systems, Master of Science Thesis 2013 11 CHAPTER 2. LITERATURE REVIEW 2.2 Review of existing features devised to quantitatively characterise chromatin distribution This section presents a review of methods proposed for quantitative characterisation of the distribution of chromatin; i.e chromatin texture. 2.2.1 Review of chromatin texture features A taxonomy of features for cytometry (cell measurement) on microscope images can be found in Appendix C. One class of features in this taxonomy are texture features. In principle such features can be applied to any image object including a cell nucleus. Two main approaches exist for describing the chromatin arrangement/texture in the cell nucleus. The first approach, called the statistical approach, assumes that texture is a realization of a stochastic process governed by a set of parameters [37] and charac- terises the chromatin distribution by second or higher order statistics. In the other, the structural approach, the chromatin distribution is assumed to be composed of primitives that are arranged according to certain placement rules. Traditionally MAC features have been based on a statistical approach to defining texture. However such features do not correspond well to the terms used by cytopathol- ogists to describe chromatin texture such as heterogeneity, granularity, margination, condensation, compaction, clumping, diffuse, blobs and particles [20]. Another difficulty is that these features are sensitive to changes in, or non-uniformity of, illumination and staining. This motivates interest in a structural approach to chromatin texture analysis. Several methods based on structural texture analysis have been proposed to detect structural alterations of the nuclear chromatin. Beil et al. [38] proposed a dual ap- proach to structural texture analysis for microscopic cell images by region or by lines. In particular they describe the texture in terms of: • the properties and the arrangement of regions; and • the properties and the arrangement of lines. Beil et al. [39] proposed a set of region texture features that corresponds to human vision, such as: number of regions, number of large regions (size>threshold), number of small regions (size<=threshold), number of regions at the boundary of the analysed area, average size of region, etc. They also proposed features for line textures including directionality and fractal dimension of line structures. Albregtsen et al. [40] presented a structural texture analysis approach that “for each pixel in the image uses concepts from adaptive filtering to find the neighbouring pixels that belong to the local texel”. Thereafter, moment based features are extracted to characterise the grey level texel as an object. Walker and Jackway [41] used features based on the statistics of geometrical (SGF) attributes of connected regions developed by Chen et al. (statistical geometrical features for texture analysis) to quantitatively characterise the chromatin in the nuclei of Papanicolaou-stained cervical cells [41]. Mehnert [20] proffered a method, called the 12 , Signals and Systems, Master of Science Thesis 2013 2.2. REVIEW OF EXISTING FEATURES DEVISED TO QUANTITATIVELY CHARACTERISE CHROMATIN DISTRIBUTION adjacency graph attribute co-occurrence matrix (AGACM) that combines both struc- tural and statistical/stochastic aspects of texture for characterising both blob-like and mosaic patterns (texture) in the plane. The Cyto-Savant imaging system computes 116 nucleus features, which can be grouped into 3 general categories: nuclear morphological features, chromatin texture, and DNA content [42]. The measured texture features can be divided into statistical and struc- tural groups. Chromatin distribution is statistically described by Markovian and non- Markovian texture features, fractal texture and run length features. Discrete texture features, however, reflect the structural aspect of chromatin distribution and are com- puted by first segmenting the nucleus into regions of low, medium, and high optical density. These regions are defined by two global thresholds. Discrete texture features characterise the segmented regions by computing their size, shape, optical density, and spatial distribution. Details of algorithms are described elsewhere [42]. , Signals and Systems, Master of Science Thesis 2013 13 3 Novel algorithm for segmenting free-lying cell nuclei T he present chapter deals specifically with the problem of accurately and robustly segmenting the cervical cell nuclei in digitized light microscopy images of Pap smears. Given that the aim of the segmentation is to support MAC analysis, the goal is not to segment every nucleus but rather to segment free-lying nuclei. These are easier to segment and the nuclear texture is less likely to be affected by overlapping. Based on the literature review in Section 2.1 we opted to develop our own novel algorithm based on the marker controlled watershed transform for detecting and segmenting free- lying intermediate cell nuclei. The remainder of this chapter is organised as follows. Section 3.1 presents the high- level description of the proposed algorithm. Section 3.2 presents the proposed marker extraction algorithm corresponding to the first step of the segmentation algorithm. Sec- tion 3.3 describes the implementation of the marker-controlled watershed algorithm for segmenting and delineating the detected nuclei. Section 3.4 presents the artefact rejec- tion strategy developed to discard segmented objects that are not nuclei. Section 3.5 describes the choice of features for the artefact rejection step. An empirical evaluation of the performance of the proposed segmentation algorithm is presented in section 3.6. Finally, section 3.7 summarises the chapter. 3.1 High-level description of the proposed algorithm The proposed segmentation approach is conceptually a 3 step process (see Table 3.1): 1. Detecting or locating the objects of interest; 2. Delineating and labelling those objects; and 15 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI 3. Artefact rejection to ensure only desired objects are retained in the segmentation results. Table 3.1: The proposed algorithm to detect and segment free-lying intermediate cell nuclei • Input: Grey-scale image containing a field-of-view (FOV) from a Pap smear slide. • Output: Binary image containing connected components, each corresponding to a free-lying intermediate cell nucleus. Steps: 1. Extract inner markers for free-lying nuclei-like objects (these locate the inte- riors of candidate objects). 2. Apply the marker-controlled watershed transform on the FOV image with respect to the inner markers (this yields an outer marker that lies between the candidate objects). 3. Apply the marker-controlled watershed transform on the gradient image with respect to the inner and outer markers (this yields the object bound- aries/masks). 4. Compute the area and quantitative measures of shape and texture granularity for each segmented nuclei-like object. 5. Reject objects that are too small or large to be intermediate cells, that do not have an elliptical shape, and that do not have a granular texture. 3.2 Extraction of inner markers (step 1) In Pap-stained cervical images, the cell nucleus appears darker than the rest of the cellu- lar material. Approaches to finding a marker (connected component) within each nucleus exploit this fact. Several candidate methods for extracting marker for nuclei were con- sidered including: tophat transform [43], Jackway tophat [44], h-minima [45]. Appendix D presents the basic methods underlying the existing nuclear marker extraction algo- rithms. However, we found that none of these methods robustly detects free-lying nuclei. Therefore, we developed a new nuclear marker extraction algorithm for this purpose. The algorithm is based on the tophat transform defined in terms of an annular closing. The grey-scale annular closing operator is defined Ψanclo(f,B) = (f B) ∨ f (3.1) 16 , Signals and Systems, Master of Science Thesis 2013 3.3. MARKER-CONTROLLED WATERSHED SEGMENTATION OF THE DETECTED NUCLEUS-LIKE OBJECTS (STEPS 2-3) where f is a grey-scale image and B is a symmetric structuring element that does not contain its origin. When annular closing is applied to a grey-scale image, isolated dark spots will be removed. The tophat by annular closing, given by Ψanclo(f,B)− f , yields the removed isolated dark spots. These spots serve as candidate cell nuclei markers [46]. The procedure is illustrated in Figure 3.1 for a single cervical cell and an annu- lar structuring element. The control over the size and relative isolation of the nuclei is achieved by changing the inner radius and outer radius of the annular structuring element. (a) (b) (c) (d) Figure 3.1: Extracting an inner marker for a free-lying cell nucleus. (a) Original image. (b) Grey-scale erosion with an annular structuring element. (c) Pixel-wise maximum of (a) and (b). (d) Arithmetic difference between (c) and (a). In order to detect a nucleus, an annular structuring element with an inner radius larger than that of the nucleus is needed. The nuclei of normal intermediate and parabasal cells measure approximately 8µm in diameter and may enlarge up to 15µm in the case of malignant or rare benign disorders changes [47]. Hence, to detect all nuclei within this range, a set of independent annular closings with structuring elements with a range of inner diameters is needed. This is then the basis for the more sophisticated inner marker extraction algorithm presented in Algorithm 1. 3.3 Marker-controlled watershed segmentation of the de- tected nucleus-like objects (steps 2-3) Marker-controlled watershed segmentation is used to delineate the boundaries of the cell nuclei detected by the inner marker extraction algorithm. Rather than flooding from the regional minima, as is the case for the traditional watershed transform, flooding is initiated from the markers. The procedure for delineating the detected nuclei-like objects is as follows. First, a watershed segmentation of the original image (f) with respect to the inner markers is performed to obtain the outer marker. Next a watershed segmentation of the gradient magnitude image with respect to the union of the inner , Signals and Systems, Master of Science Thesis 2013 17 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI markers and the outer marker is performed. The resulting watershed lines then delineate Algorithm 1 Proposed nuclei inner marker extraction algorithm Input: Grey-scale image (f), and parameters λ0, α,r1 and r2 Output: Inner markers binary mask (Xm) for free-lying nuclei-like objects. 1: Let B0 be a disk structuring element of radiusλ0 2: for λ = r1 → r2 do 3: Let Ban be an annular structuring element with inner and outer radii of λ and λ+ α respectively. 4: g = Ψanclo(f,B)− f 5: X1 = g > 0 6: X2 = (X1 B0)⊕B0 7: X = X ∪X2 8: end for 9: Xm = set of centroids of the connected components in X. the nuclei-like objects. This idea is illustrated in Figure 3.2. The reason behind using the gradient image is that the nuclei boundaries are located on high gradient points. Therefore the watershed transform results in regions with boundaries corresponding to those of the nuclei. The gradient magnitude image can be quite noisy. For this reason, the original image is median filtered to remove impulse noise (size of the kernel is selected to be 3× 3), its magnitude of gradient is computed, and the result is Gaussian filtered. There exist many different approximations for the computation of the magnitude of the gradient of an image. Several approaches with similar results have been tested. Among all these methods, the very simple and straight forward Gaussian derivative oper- ator is used. The standard deviation of Gaussian filter is selected to be 0.9. The gradient magnitude image is approximated by sum of the absolute values of the convolution of the image with the vertical and horizontal Gaussian derivative operators. 18 , Signals and Systems, Master of Science Thesis 2013 3.3. MARKER-CONTROLLED WATERSHED SEGMENTATION OF THE DETECTED NUCLEUS-LIKE OBJECTS (STEPS 2-3) (a) Watershed segmentation of the original image with respect to the inner markers (shown as disks) yielding the outer marker. (b) Gaussian filtered gradient magnitude of the median-filtered original image in (a). (c) Watershed segmentation of (b) with respect to the union of the inner markers and the outer marker. Figure 3.2: Segmentation of the detected nuclei-like objects. , Signals and Systems, Master of Science Thesis 2013 19 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI 3.4 Artefact Rejection (steps 4-5) The presence of overlapping cells, overlapping and/or folded cytoplasm, blood and cellu- lar debris presents a formidable challenge for nuclei segmentation [21]. Ensuring that the segmented objects are nuclei of cells relevant for MAC analysis is essential. The artefact rejection strategy is based on the size, shape and granularity of the objects. Quantitative measures of the area, elliptical shape, and the texture/granularity are computed for each segmented nucleus-like object. 3.4.1 Size criterion The squamous epithelium of the female genital tract is composed of three principal layers [47]: the basal cell layer (immature), the intermediate cell layers, and the superficial cell layers (most mature) (see Figure 3.3). As the cell maturation progresses toward the surface, the amount of cytoplasm per cell increases [47]. The nuclei of superficial cells are pyknotic and considerably smaller than intermediate and parabasal cells with a nuclear diameter of about 4µm [47]. In a normal Pap smear usually only the upper few layers of the squamous epithelium are removed and so the immature cells near the base of the epithelium are not sampled [16]. Given that the aim of the algorithm is to detect intermediate cell nuclei (presenting a fine network of chromatin and chromocenters suitable for further MAC analysis), a threshold value can be defined to remove the superficial cell nuclei from the segmentation result. The nuclei of normal intermediate cells ranges approximately between 8µm to 15µm in diameter and are oval-shaped [47], accordingly the minimum area can be defined. The minimum area is deemed to be the area of a circle with radius rmin. This artefact rejection step removes not only the superficial squamous cell nuclei but also many of the small objects belonging to image artefacts. 20 , Signals and Systems, Master of Science Thesis 2013 3.4. ARTEFACT REJECTION (STEPS 4-5) Figure 3.3: Cells of the squamous epithelium (freehand adaptation of Koss (2006, Figure 5-4)). 3.4.2 Shape criterion Nuclei of the intermediate cervical cells are elliptical in shape [47]. To assess whether a candidate nucleus object is elliptical, the elliptic variance feature is computed. Some other features we looked at to fulfill the shape criterion are Danielsson’s G shape fac- tor, ellipticity, elliptic variance and Dice-area (see Appendix E). The elliptic variance descriptor (Evar) [48] measures how closely the borders of the fitted ellipse agree with those of the segmented object (see Figure 3.4). The simplest way to generate a signature for coarseness of the boundary of an object is to compute the radial distances of the object boundary from the object centroid. Suppose g = [gx, gy] T is the centroid of the object and object boundary has N data points pi = (xi, yi) T . The covariance matrix of the data points is calculated using: C = 1 N N∑ i=1 (pi − g)(pi − g)T Then the radial distances (d′i) of boundary points from the centroid are calculated as: d′i = √ (pi − g)T ·C · (pi − g) , Signals and Systems, Master of Science Thesis 2013 21 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI Figure 3.4: Ellipse fitting to an arbitrary shape The mean (µ′R) and standard deviation (σ′R) of the radial distances over all boundary points are: µ′R = 1 N N∑ i=1 (d′i) and σ′R = √√√√ 1 N N∑ i=1 (di − µ′R)2 and finally Evar is defined: Evar = σ′R µ′R (3.2) It should be noted that for elliptical objects Evar is close to 0. 3.4.3 Texture criterion The feature chosen to measure the degree of the granularity of the texture in a nucleus is the Tamura coarseness feature. Tamura [49] devised a texture model corresponding to visual perception. It is useful and robust in the sense that it does not depend directly on the exact grey-levels in the object and so has robustness to non-uniformity of illumination and staining variations (provided that these do not greatly affect the size and number of texture primitives). The Tamura coarseness aims to pick the biggest size where the texture is present. The primitive elements (textures) are larger in size but smaller in number for a coarse texture, while a fine texture contains a large number of small primitives. The computational procedure can be summarized in the following steps. 1. Take averages at each pixel (x,y) over neighbourhoods whose sizes are powers of 22 , Signals and Systems, Master of Science Thesis 2013 3.4. ARTEFACT REJECTION (STEPS 4-5) two 2k × 2k. A (x,y) = x+2k−1−1∑ i=x−2k−1 y+xk−1−1∑ j=y−2k−1 f(i,j) 22k (3.3) where f(x,y) is the grey-level at (x,y). The averaging procedure at different levels is depicted in figure 3.5. Figure 3.5: Computing averages at different scales 2. At each pixel, take the absolute difference between pairs of non-overlapping aver- ages on opposite sides of the point in both horizontal and vertical directions. The difference in the horizontal case is: Ek,h(x,y) = ∣∣∣Ak(x+ 2k−1,y)−Ax(x− 2k−1,y) ∣∣∣ (3.4) 3. At each pixel, pick the best size of k which maximizes the difference Ek(x,y) in either direction and set the best size to Sopt(x,y) = 2k. 4. Calculate the coarseness measure by taking the average of Sopt over the entire image: Fcrs = 1 m× n m∑ i n∑ j Sopt(i,j) (3.5) Where, m and n are the width and height of an image, respectively. Figure 3.6 shows 6 images of cell nuclei with their corresponding Tamura coarseness measures. , Signals and Systems, Master of Science Thesis 2013 23 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI a b c d e f Figure 3.6: Measure of Tamura coarseness for 6 different cell nuclei. Cell nuclei a b c d e f Coarseness measures 9.10 9.52 9.56 10.41 10.72 10.9 (a) (b) Figure 3.7: Two examples of artefacts rejected by the Tamura coarseness feature. 24 , Signals and Systems, Master of Science Thesis 2013 3.4. ARTEFACT REJECTION (STEPS 4-5) Figure 3.8: Two sample FOVs from a Pap smear slide. Twelve samples of the segmented objects before the artefact rejection step are highlighted. The quantitative measures of shape and granularity for these objects are in Table 3.2 , Signals and Systems, Master of Science Thesis 2013 25 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI Table 3.2: Shape and granularity features for 6 different cell nuclei. No Cell Nuclei Area Evar fcourse 1 480 0.031 9.48 2 331 0.69 8.79 3 924 0.07 11.95 4 740 0.043 12.23 5 1325 0.033 10.13 6 390 0.104 10.13 26 , Signals and Systems, Master of Science Thesis 2013 3.4. ARTEFACT REJECTION (STEPS 4-5) Table 3.3: Shape and granularity features for 6 non-nuclei artefacts. No Artefacts Area Evar fcourse 7 311 0.1138 10.78 8 460 0.129 11.10 9 392 0.072 9.80 10 336 0.137 9.31 11 288 0.109 12.80 12 464 0.209 11.81 , Signals and Systems, Master of Science Thesis 2013 27 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI 3.5 Empirical evaluation of the proposed segmentation al- gorithm It was noted in the review in section 2.1 that many of the proposed segmentation al- gorithms have not been formally empirically evaluated. Herein we present an empirical evaluation of our proposed method relative to expert manual segmentation. 3.5.1 Image data The data used in this study is a subset of 889 fields of view (FOVs) captured by a cytopathologist from 68 Pap smear slides. Each FOV was acquired using a CCD camera mounted on a light microscope. The images were captured with a 40× objective lens. Each FOV image is of size 1024 × 1344 pixels with square pixels of size 0.25µm. The grey-scale resolution is 8 bits per pixel. 3.5.1.1 Ground truth generation Eleven slides, each containing a minimum of 100 non-superficial cervical cell nuclei, were randomly selected from among the 68 slides. For each slide three FOVs were randomly selected to yield a total of 33 FOVs. Two graphical user interfaces (GUI) were developed to permit a user to review each FOV and to place a marker on individual nuclei and also trace the boundary of nu- clei. These two GUIs were designed specifically for the two purposes of marking and manual border delineation of cell nuclei (see Figures 3.9 and 3.10). They provide the following functions: cell nuclei marking function (by clicking within the nucleus area), border delineation function (using the mouse to trace the cell nuclei boundaries) and a measurement tool for checking the desired particles within the image. Figures 3.9 and 3.9 show two GUIs designed and used for ground truth generation. Three untrained subjects were recruited to independently review the FOVs using the GUI and to mark each free-lying nucleus. Prior to performing this task each was shown examples of intermediate cell nuclei in another FOV (not one of the 33 FOVs they had to review). Each subject was specifically instructed to mark elliptical objects, of approximately the right size, with a well-defined boundary, and with a granular texture. The set of all objects selected by at least two of the three subjects were taken to be the ground truth for free-lying intermediate cell nuclei. Two image analysis experts (authors) then used the GUI to independently trace the boundary of each ground truth nucleus. These manual segmentations were taken to be the ground truth for the boundaries of the free-lying intermediate cell nuclei. 28 , Signals and Systems, Master of Science Thesis 2013 3.5. EMPIRICAL EVALUATION OF THE PROPOSED SEGMENTATION ALGORITHM F ig u re 3 .9 : G U I fo r G ro u n d tr u th g en er a ti o n , Signals and Systems, Master of Science Thesis 2013 29 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI F ig u re 3 .1 0 : G U I fo r G ro u n d tru th g en era tio n 30 , Signals and Systems, Master of Science Thesis 2013 3.5. EMPIRICAL EVALUATION OF THE PROPOSED SEGMENTATION ALGORITHM 3.5.2 Method and Experiment The proposed segmentation algorithm was applied to the 33 FOV images. The param- eters for different steps of the algorithm were selected after several experiments on a small subset of images independent of the 33 selected FOV images. The minimum and maximum values of the inner radius (r1 and r2) of the annular structuring elements (see algorithm 1) were set to 22 and 33 pixels. The values were selected based on the nucleus diameter range for the intermediate cell nuclei defined by Koss [47]. The outer radius of each annulus was set to be two pixels more than the inner radius (i.e. α = 2) to guarantee the extraction of inner markers of the adjacent free-lying nuclei. Finally the size of the disk-structuring element (λ0) was set to 3 pixels (for noise removal). The parameters for the artefact rejection step were also tuned on an independent data set. Based on the defined size range of intermediate cell nuclei in [47], the threshold value of 400 for segmented regions was selected for artefact rejection (The minimum area is deemed to be the area of a circle with radius rmin). The threshold value of 0.095 and 13.2 were selected, respectively, for the elliptic variance and Tamura coarseness features. Objects selected by the algorithm were compared to the ground truth nuclei obtained manually and used to compute the sensitivity and specificity of the algorithm for the detection of free-lying intermediate cell nuclei. The accuracy of segmentation by the proposed algorithm for each detected mask was assessed through comparison of the boundaries of the segmented object to the two cor- responding boundaries (boundaries traced by two experts) in the ground truth datasets. The similarity between pairs of masks and the similarity is computed. More specifi- cally the similarity between pairs of masks was computed in terms of the Dice similarity coefficient (DSC) scores defined by equation E.3. 3.5.3 Results The sensitivity and specificity of the algorithm for the detection of free-lying interme- diate cell nuclei is 94.71% and 85.3% respectively. Box-plots of the DSC scores for the comparison of the proposed automatic segmentation to the two manual segmentations, and for the comparison between the two manual segmentations are shown in Figure 3.11. , Signals and Systems, Master of Science Thesis 2013 31 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI Figure 3.11: DSC scores for the comparison of the proposed automatic segmentation to the two manual segmentations, and for the comparison between the two expert segmentations (GT1 and GT2). The agreement between the algorithm and the two manual segmentations is 97.30± 1.35% and 96.96 ± 1.7% respectively (mean DSC±standard deviation). The overall agreement between the two expert segmentations is 97.26± 1.2% . 3.5.4 Discussion The sensitivity of the algorithm to the detection of free-lying intermediate cell nuclei is remarkably high. However its specificity, whilst still quite high, could be improved. A review of false positives indicates that some of them are due to segmentation failures as the result of severe background noise and artefacts. However other apparent failures in fact represent genuine free-lying nuclei overlooked by the three recruited subjects. Understanding the reasons for an algorithm’s failure modes can be used to improve its robustness. The main failure modes responsible for detection of false positive cases are discussed bellow. The DSC scores for boundary delineation evaluation (see Figure 3.11) show that nuclei boundaries obtained using the marker-controlled watershed transform are highly accurate and consistent with the two experts’ visual perception of the intermediate cell nuclei boundaries. 3.5.4.1 Failure modes In the experiment 111 false positive cases were detected. In many of the cases the false positives are visually similar in appearance to a real nucleus. Therefore the marker 32 , Signals and Systems, Master of Science Thesis 2013 3.5. EMPIRICAL EVALUATION OF THE PROPOSED SEGMENTATION ALGORITHM detection algorithm extracts them as a nucleus-like object. The aim of the artefact re- jection step is to discard a segmented artefact. The ellipticity and size features effectively eliminate any object with an irregular size or shape. After reviewing the false positive we discovered we concluded that the Tamura coarseness feature, whilst quite effective in most cases, did not reject some of the artefacts as expected. Figure 3.12 shows two such cases. The results suggest that other features be considered in future to improve the specificity. (a) (b) Figure 3.12: (a) and (b) show two examples of false positive detection by the algorithm. There were very few cases where there was no marker for a nucleus. Among the total 755 nuclei present in our database, the algorithm did not produce any marker for 14 nuclei. This basically happens when the nucleus is surrounded by many other objects, especially when the adjacent objects have considerably lower intensity compared to the the nucleus. An example of this type of failure is depicted in Figure 3.13a. All the other nuclei markers were successfully extracted and no multiple seeds for a single object were detected. The other mode of failure was due to inaccurate border delineation by the watershed algorithm. This happens when there is very weak gradient information on the boundary of the nucleus (see Figures 3.13b and 3.13c). A total of 6 nuclei were missed due to watershed failure. Finally the particular thresholds defined for artifact rejection features were responsible for eliminating some of the true segmented free-lying nuclei. First of all the size feature has caused the rejection of the some of the segmented nuclei that have slightly lower size than the predefined threshold. The ellipticity and the Tamura coarseness features were responsible for discarding the rest of the true segmented nuclei. Overall, 20 segmented regions that were truly segmented were rejected by the artefact rejection features. Figure 3.14 shows some examples of these failure modes. , Signals and Systems, Master of Science Thesis 2013 33 CHAPTER 3. NOVEL ALGORITHM FOR SEGMENTING FREE-LYING CELL NUCLEI (a) (b) (c) Figure 3.13: Examples of missed nuclei due to presence of background noise surrounding the nucleus and watershed transform failure: (a) The marker extraction algorithm has failed to provide a marker for the nuclei. (b) Example of failure of the morphological watershed technique because of sharp edge information inside the object, as seen in the gradient image (c) (a) (b) (c) (d) Figure 3.14: Examples of missed nuclei due to unsuitable choice of parameter values: (a) and (b) The two objects have approximately the same size, however, the segmented nucleus has a size slightly above the threshold value while the size of the other nucleus is slightly below, hence discarded as an artefact. (c) and (d) Example of an accurately segmented object which was discarded as an artefact in the artefact rejection step. This is because the value for Tamura coarseness feature of this nucleus is lower than the defined threshold value. 3.6 Summary This chapter presented a new algorithm for segmentation of free-lying cell nuclei in Pap smears. The novelty of the algorithm stems from a robust marker selection method for selecting candidate free-lying nuclei-like objects for subsequent marker-controlled water- shed segmentation to obtain the nuclear boundaries. The algorithm also implements artefact rejection based on size, shape, and nuclear granularity to ensure only the nuclei of intermediate squamous epithelial cells are retained. In addition, this chapter presented an empirical evaluation of the performance of the algorithm and discussed the results. The sensitivity and specificity of the algorithm to the detection of free-lying intermediate cell nuclei together with the accuracy of the segmentation of each nucleus detected by the algorithm were presented. 34 , Signals and Systems, Master of Science Thesis 2013 4 Texture features characterising chromatin distribution T his chapter is concerned with the quantitative characterisation of the texture of chromatin. A review of chromatin texture features proposed to date was presented in chapter 2. A novel set of structural chromatin texture features are presented in this chapter. These are derived from a segmentation of the chromatin into blob- like primitives. By considering each of the chromatin particles as individual regions (objects), many features can be extracted that describe the distribution and arrangement of those particles. This is the structural way of describing texture where the texture is characterised by a description of its primitives and placement rules [20]. Using the segmented chromatin particles, two sets of features can be measured: Fea- tures that solely characterise the blob-like particles in terms of measurements such as area, perimeter, dynamics, etc. and features that characterise the spatial relationships existing between the chromatin particles. The remainder of this chapter is organised as follows. Section 4.1 presents the utilised chromatin segmentation algorithm. Section 4.2 presents the proposed structural texture features to characterise chromatin texture/distribution. Finally, Section 4.3 summarises the chapter. 4.1 Chromatin segmentation Chromatin is the material in the cell nucleus consisting of DNA and associated proteins. It is so-named because of its ability to take on stain. In a Pap smear nuclear chro- matin is visualised under light microscope as a mosaic of interchanging light and dark regions within the nucleus (dark and light regions with high and low optical density1, 1The optical density also called extinction of a material is the logarithmic ratio of the incident radiation to the transmitted radiation through that material. 35 CHAPTER 4. TEXTURE FEATURES CHARACTERISING CHROMATIN DISTRIBUTION respectively). Regions of high optical density are well defined as particles [50]. The method chosen to segment the chromatin is proposed by Mehnert [20]. It can be used to segment the dark and light particles (blobs) in a grey-level image of nucleus. The grey levels in the image can represent either the intensity or optical density. Figure 4.1 shows a representation of a nucleus as a topographic relief. The dark particles or chromatin blobs correspond to valleys and light particles correspond to mountains. In a typical topographic relief, the valleys are associated with minima and the mountains are associated with maxima. Regional Maxima Regional Minima Figure 4.1: Representation of a cervical cell nucleus as a topographic relief (grey-level represents height). The chromatin segmentation algorithm proposed by Mehnert [20] has the following steps: 1. Locating the regional minima; • Applying a 3x3 median filter (to attenuate impulse noise); • Up-sampling by a factor of 3 (to facilitate the rendering of watershed lines in a subsequent step); • Locating the regional minima (inner markers). 2. Computing the watershed transform of the filtered image with respect to the inner markers (to produce an outer marker that delineates a zone of influence around each regional minimum); and 3. Computing the magnitude of gradient image for the filtered image; and 4. Applying the watershed transform to the gradient image with respect to both the inner markers and the outer marker (to delineate each dark chromatin blob). Figure 4.2 shows examples of the segmentations produced by this algorithm. Light particles can also be obtained by applying the algorithm to the negative of the image. 36 , Signals and Systems, Master of Science Thesis 2013 4.2. PROPOSED STRUCTURAL TEXTURE FEATURES (a) (b) (c) (d) Figure 4.2: Chromatin segmentation. (a) Original image of cell nucleus. (b) Watershed transform of (a) with respect to regional minima (inner markers) yielding the outer marker. (c) The morphological gradient image of (a). (d) Watershed segmentation of (c) with respect to the union of the inner markers and the outer marker. 4.2 Proposed structural texture features In this section we present a novel set of structural chromatin texture features derived from a segmentation of the chromatin. By considering each of the chromatin particles as individual regions (objects), many features can be extracted that describe the distribu- tion and arrangement of those particles. Using the segmented chromatin particles, two sets of features can be measured: features that characterise the spatial relations exist- ing between the chromatin particles and features that solely characterise the blob-like particles through measurements such as area, perimeter, dynamics, etc. 4.2.1 Graph-based features The graph-based features are derived from ordinary delaunay graph (the dual of the or- dinary Voronoi diagram), several graphs related to the area-Voronoi diagram–generalised Delaunay graph, generalised Gabriel graph, generalised relative neighbourhood graph, and the generalised minimum spanning tree–and from the area-Voronoi diagram itself. These graphs describe different adjacency relationships between the chromatin particles. Formally, a graph is represented by G = (V,E) where V is called the vertex set and E is the edge set of the graph. Vertices correspond to particles and edges denote adjacency (see [51] for details concerning the computation of the graphs). Our proposed graph-based features can be categorised into 6 groups: Delaunay graph based features, features based on the area-Voronoi diagram, features based on the gen- eralised Delaunay graph drived from the area-Voronoi diagram, generalised Gabriel and relative neighbourhood graph based features, and finally features based on the gen- eralised minimum spanning tree. Definitions and basic geometric properties of these graphs can be found in [51]. Figure 4.3 shows the area-Voronoi diagram constructed by computing the watershed transform of the Euclidian distance transform of the seg- mented chromatin particle masks and the generalised Delaunay graph derived from the area-Voronoi diagram. Figure 4.4 shows the graphs drived from the area-Voronoi diagram. The generalised Delaunay graph (DG) is obtained directly from the area-Voronoi diagram. Generalised , Signals and Systems, Master of Science Thesis 2013 37 CHAPTER 4. TEXTURE FEATURES CHARACTERISING CHROMATIN DISTRIBUTION (a) (b) (c) (d) (e) Figure 4.3: Illustration of how the area-Voronoi diagram and generalised Delaunay graph are computed. (a) Segmented chromatin particles superimposed on the original nucleus image. (b) Euclidian distance transform of the set complement of the union of the chromatin particle masks in (a). (c) Area-Voronoi diagram. (d) The area-Voronoi diagram clipped to the nuclear mask. (e) The region adjacency graph defined on the Voronoi regions then defines a generalised Delaunay graph. Gabriel graph (GG), generalised relative neighbourhood graph (RNG), and generalised minimum spanning tree (MST ) are all obtained as sub-graphs of the generalised Delau- nay graph. In particular the generalised Delaunay graph is treated as an ordinary De- launay graph, with vertices now corresponding to particle centroids, and the sub-graphs are obtained by removing select edges. This means that they each have the same vertices and the edges satisfy MST ⊆ RNG ⊆ GG ⊆ DG. Several features can be computed from the area-Voronoi diagram and associated graphs to quantify the arrangement of the chromatin particles inside the nucleus. Features extracted from the Delaunay, Gabriel and relative neighbourhood graphs measure the distances between chromatin particles and describe variations in the connectivity of the nodes. Voronoi-based features describe the shape and size of the Voronoi polygons/regions. The minimum spanning tree con- nects all the vertices in a given graph in a way that the sum of the edge lengths are the minimum possible. (d )(c)(b )(a) Figure 4.4: Graphs derived from the area-Voronoi diagram. (a) the generalised Delaunay graph (b) the generalised Gabriel graph (c) the generalised relative neighbourhood graph (d) the generalised minimum spanning tree. See the text for details. 38 , Signals and Systems, Master of Science Thesis 2013 4.2. PROPOSED STRUCTURAL TEXTURE FEATURES 4.2.2 Margination features Our margination features characterise the distances of the segmented chromatin blobs to the nucleus boundary. Six of these features are computed from the cookie-cutting- distance [20] illustrated in Figure 4.5. The slopes of the cumulative frequency curve of the cookie-cutting-distance at the 25th and 37th percentiles are calculated (M1 and M2) as two measures of margination for each nucleus1. Two other features for characterising margination are derived by computing the mean and standard deviation of the distances in the cookie-cutting-distance image (M3 and M4). These two features also include information about the size of the nucleus, which is an indicator of the state of cell in certain disease processes [52]. In addition, features M3 and M4 are both normalized to the maximum value of the nucleus boundary distance transform to yield pure measures of margination (M5 and M6). One additional feature is the mean of the minimum distances between the geometrical centres of chromatin particles and the nucleus boundary (M7). 0 5 10 15 20 25 30 35 0 500 1000 1500 C u m m u la ti ve H is to g ra m C u m m u la ti ve H is to g ra m Distance from boundary Distance from boundary (a) (c) (g) (b) (f) Slope = 35.05 (d) 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 1200 1400 (e) (h) Slope = 37.74 Figure 4.5: Illustration of how the margination features M1 and M2 are computed. (a),(e) Segmented chromatin particles. (b),(f) Euclidean distance transform of the nucleus boundary. (c),(g) Portions of (b),(f) cut out by the segmented chromatin particle masks. This is called cookie-cutting-distance. (d),(h) Slope of the cumulative frequency curve of the cookie-cutting-distance at the 37th percentile. 4.2.3 Clustering features Four quantitative measures for chromatin particle clustering are proposed. For the first measure, the distance transform of the area-Voronoi diagram circumscribed by nucleus 1These percentiles were chosen empirically using the cell images that were not used in the empirical evaluation described in the next chapter. , Signals and Systems, Master of Science Thesis 2013 39 CHAPTER 4. TEXTURE FEATURES CHARACTERISING CHROMATIN DISTRIBUTION boundary is calculated. Portions of the resultant image are cut out by the segmented blob masks. Afterwards, the cumulative frequency histogram of this image is constructed and then the slopes of the cumulative frequency curve at the 25th and 37th percentile2 are calculated to give the first two measures C1 and C2. Features C3 and C4 are computed from the area-Voronoi diagram and the distance transform image used in its construction (Figure 4.7a and 4.7b) as follows. Firstly for every pair of Voronoi regions that share a common edge, the convex hull is computed for the union of the corresponding pairs of particles. Secondly the union of these convex hulls is filled to form a binary mask (Figure 4.7d). Thirdly this mask is intersected with the divide lines of the area-Voronoi diagram (Figure 4.7e). Finally features C3 and C4 are the mean and standard deviation, respectively, of the distance transform values corresponding to the residual divide line pixels in this intersection (Figure 4.7f). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 100 200 300 400 500 600 700 800 900 1000 Slope = 131.90 C u m m u la ti ve H is to g ra m C u m m u la ti ve H is to g ra m Distance from boundary Distance from boundary (g)(f)(e) (h) 1 2 3 4 5 6 7 8 9 10 11 12 0 200 400 600 800 1000 1200 Slope = 191.39 (a) (c)(b) (d) Figure 4.6: Illustration of how the features C1 and C2 are computed. (a),(e) Area- Voronoi diagram and segmented chromatin particles. (b),(f) Distance transform of the Voronoi regions. (c),(g) Portions of (b),(f) cut out by the segmented chromatin particle masks. (d),(h) Slope of the cumulative frequency curve of the distances in (c),(g) at the 37th percentile. 2See footnote 1. 40 , Signals and Systems, Master of Science Thesis 2013 4.2. PROPOSED STRUCTURAL TEXTURE FEATURES (a) (b) (c) (f)(e)(d) −2 0 2 4 6 8 10 12 14 16 0 5 10 15 20 25 30 35 40 45 50 Figure 4.7: Illustration of how the features C3 and C4 are computed. (a) Segmented chromatin particles. (b) Distance transform of the set complement of the union of the chromatin particle masks. (c) Area-Voronoi diagram. (d) Binary mask (with particles superimposed) formed from the union of the convex hulls (see the text for details). (e) Voronoi region divide lines clipped according to the mask in (d). (f) Frequency histogram of the distance transform values corresponding to the residual divide line pixels in (e). 4.2.4 Contextual chromatin particle features Contextual features provide information based on object-object or object-scene interac- tions. The information related to the segmented chromatin blobs can be integrated with the higher level contextual information. For example, the area ratio of each chromatin particle to its watershed ZOI captures information about object area as well as back- ground contextual information. Table 4.1 lists the contextual chromatin particle features measured in this study. 4.2.5 Statistics of chromatin particle morphometric features Morphometric features describe the geometry (shape, size, position and boundary) of the chromatin blob and are computed from its binary mask. By computing the mean and standard deviation of the extracted features for all the blobs inside a nucleus, a structural texture measure for the nucleus is obtained. , Signals and Systems, Master of Science Thesis 2013 41 CHAPTER 4. TEXTURE FEATURES CHARACTERISING CHROMATIN DISTRIBUTION Table 4.1: Summary of the extracted nucleus features Morphometric Features Area, perimeter, mean-radius, elliptic variance [53] Densitometric Features Photometric features [54]: Integrated optical density (IOD), mean optical density (MOD), variance of optical density, skewness of optical density, kurtosis of optical density Texture Features S ta ti st ic a l Fractal texture features [42]: Fractal area 1, fractal area 2, fractal dimension Run length features [55]: SRE, LRE, GLN, RLN, RP, LGRE, HGRE, SRLGE, SRHGE, LRLGE, LRHGE Histogram based features [42]: Mean, standard deviation, and skewness of the grey-level histogram GLCM features [54]: Contrast, correlation, energy, entropy, local homogeneity, maximum probability, cluster shade, cluster prominence and information measure of correlation (H-IMC1, H-IMC2) Complex Daubechies wavelet features of nuclei [56]: Mean, standard deviation, and skewness of the grey- level histogram. GLCM features – contrast, correlation, energy, entropy, local homogeneity, maximum probability, cluster shade, cluster prominence and information measure of correlation (W -IMC1, W -IMC2) Statistics of the chromatin particle morphometric features Mean of chromatin particle areas and perimeters, mean of chromatin particle areas normalized to the nucleus area, mean of chromatin particle perimeters normalized to the nucleus perimeter, chromatin particle compactness (P2A) Contextual chromatin particle features S tr u c tu ra l Margination: M1, M2, M3, M4, M5, M6, M7 (See section 4.2.2) Clustering: C1, C2, C3, C4 (See section 4.2.3) Blob features: Heterogeneity (the ratio between the area of the segmented dark and light regions, and the nucleus area) [52], the ratio between the area of the segmented dark regions and the nucleus area, number of segmented dark particles, number of segmented light particles, area ratio of each chromatin particle to its watershed ZOI, average distance between the geometrical center of the nucleus and pixels of all chromatin particles, distance between the geometrical center of the nucleus and center of mass of the chromatin particles Discrete texture features [42, 57]: Medium and high DNA amount, medium and high DNA area, medium and high DNA compactness, medium-high DNA compactness, center of gravity (the distance from the geometrical center of the blob to the center of gravity of the optical density function) Area-Voronoi diagram: Mean and standard deviation (SD) of the areas of the Voronoi regions, mean and SD of the areas of the Voronoi regions normalized to the nucleus area, area disorder [55], Voronoi regions roundness [55], mean of area ratio of each chromatin particle to its Voronoi region [55] Delaunay graph and generalised Delaunay graph: Mean of the Delaunay triangle areas, mean and SD of the Delaunay edge lengths, average of the mean and SD of the edge lengths connected to each vertex (chromatin particle), maximum Delaunay edge length, mean of the number of connections per chromatin particle, number of vertices (chromatin particles) on the graph boundary Gabriel graph and relative neighbourhood graph: Mean and SD of the graph edge lengths, maximum edge length, average of the mean and SD of the graph edge lengths connected to each vertex (chromatin blob), mean of the number of connections per vertex (chromatin blob) Minimum spanning tree: Mean and SD of the edge lengths, total edge length, edge disorder [55], minimum to maximum edge ratio, mean of the number of connections per vertex, percentage of vertices connected to one vertex (MST1), percentage of vertices connected to two vertices (MST2), percentage of vertices connected to more than two vertices (MST3) 42 , Signals and Systems, Master of Science Thesis 2013 4.3. SUMMARY 4.3 Summary The chapter presented a set of novel structural texture features for quantifying nuclear chromatin patterns in cells on a conventional Pap smear. The features are derived from an initial segmentation of the chromatin into blob-like texture primitives. Several new features were introduced to quantify the qualitative description of chromatin used by cytoprofessionals, such as margination and clustering, using an structural approach to texture analysis requiring an initial segmentation of chromatin. , Signals and Systems, Master of Science Thesis 2013 43 5 Evaluation of the proposed texture features for screening Pap smear slides T his chapter presents an empirical evaluation of the performance of the structural chromatin texture features proposed in the previous chapter. In particular two experiments, using clinical Pap smears sourced from the department of pathol- ogy, Regional Cancer Centre (RCC), Thiruvananthapuram in India, are detailed. The first is a comprehensive feature selection experiment that sought to determine the most discriminatory subset of features, from among the proposed features and features drawn from the literature, for discriminating between normal and abnormal smears. The sec- ond is a classification experiment that sought to evaluate the performance of a variety of classifiers, built using the feature sets obtained in the first experiment, for discriminating between normal and abnormal slides on the basis of only normal-appearing cells. The remainder of this chapter is organized as follows. Section 5.1 presents the image data used in the experiments. Section 5.2 details the feature selection methods considered in this study. Section 5.3 details the first experiment carried out to obtain the most discriminatory subset of features. Section 5.4 then details the second experiment. 5.1 Pap smear images and ground truth The image data used in this study originate from a set of 68 conventional Pap smear slides sourced from the Regional Cancer Centre (RCC), Thiruvananthapuram in India. Each slide was reviewed by a cytopathologist and assigned a cytological diagnosis according to the Bethesda system [58]. The cytopathologist subsequently acquired representative FOVs from each slide. In the case of abnormal smears this included FOVs with and without diagnostic cells. The cytopathologist also labelled individual cells in each FOV 45 CHAPTER 5. EVALUATION OF THE PROPOSED TEXTURE FEATURES FOR SCREENING PAP SMEAR SLIDES accordingly. The FOV images were acquired using a monochrome CCD camera mounted on a light microscope with a 40× objective lens with a numerical aperture of 0.95. Each image has a grey-scale resolution of 8 bits per pixel and is of size 1024×1344 pixels with square pixels of size 0.25µm. 5.2 Feature selection methods considered in this study 5.2.1 The curse of dimensionality and the need for feature selection The first step in designing a classifier is to have a dataset for training and evaluating the performance of the classifier. It is unequivocal that this data must contain represen- tative observations from each class, and that the sample size determines the number of features required to build a classifier to discriminate between different classes [59]. As illustrated earlier in the chapter 4, a wide range of features can be computed and those can be used for building the classifier. However, increasing the number of features to build a classifier tends to increase the misclassification error. In addition, the prediction variability increases and the classifier becomes very sensitive to the outliers. Hence, there would be no guarantee for the designed classifier to have roughly the same classification performance on a new data set [59]. It is important to point out that building a classifier with too many features on a too small training dataset can lead to a “perfect” classification performance on the training dataset, but very poor performance on unseen test data. Feature selection is a dimensionality reduction process, which aims to select an opti- mum subset of features from the original potentially discriminating set of features. There are two main reasons for using feature selection. First of all, irrelevant and redundant features are detrimental for machine learning algorithms. For a given fixed number of training samples, the predictive power of the classifier decreases as the dimensionality increases. Feature selection techniques are intended to avoid the curse of dimensionality by removing irrelevant and redundant features. Secondly, feature selection techniques greatly reduce the measurement and computational cost of classifying high-dimensional data [59]. The appropriate selection of relevant feature subset makes it possible to achieve excellent performance on classification of malignancy. In this study, the following three feature selection methods were investigated: the state-of-the-art multiple support vector machines with recursive feature elimination (MSVM-RFE) [60], L1-regularization path for generalised linear models [61], and the guided regularized random forest (GRRF) [62]. 5.2.2 MSVM-RFE The feature extraction technique utilising Support Vector Machine based on recursive feature elimination (RFE) was first proposed by Guyon [63]. It returns a ranking of the features of a classification problem by training a SVM with a linear kernel and removing the feature with smallest ranking score. 46 , Signals and Systems, Master of Science Thesis 2013 5.2. FEATURE SELECTION METHODS CONSIDERED IN THIS STUDY The criterion used by the SVM-RFE is the weight magnitude and it is related to each feature’s support to the discrimination function. The recursive feature elimination algorithm, recursively identifies the features with the smallest weights in magnitude and removes them. Multiple SVM-RFE [60] is another feature selection method that uses backward feature elimination procedure similar to SVM-RFE. However, instead of using a single linear SVM at each recursive step, multiple linear SVMs are trained on sub-samples of the training data obtained from multiple runs of k-fold cross validation. The feature ranking score is then computed from statistical analysis of the weight vectors of the multiple linear SVMs. Similar to the SVM-RFE method, the feature with the smallest ranking score is omitted at each step. 5.2.3 Guided Regularized Random Forest (GRRF) Decision trees are increasingly used for the purpose of the feature selection. Bagged random trees known as Random forest (RF) have also been widely utilised for measuring the feature importance. Those feature importance scores can be used in the feature selection process to shortlist the most discriminating features and taking into account only the features with high importance scores. The feature shortlisting procedure consists of multiple iterations. The feature with the lowest importance score will be eliminated from feature set per iteration. The number of features to be eliminated per iteration can be increased to make the algorithm more time and cost efficient (the authors implemented the algorithm to eliminate 20% of the features per iteration) [64]. The main drawback of the ordinary random forest is that at every node, a feature that optimizes an information-theoric criterion will be selected to split the data. However, the redundancy of that feature to the features selected in the previous splits is not considered in this procedure. A regularization framework can address the issue and can avoid selecting a new feature for data splitting at each node when that feature produces a similar gain (im- portance scores) to the feature set that has been selected in the previous splits. The RF gives an importance score for each feature, though it does not end up in a feature subset. Hence, the regularization framework can be easily added to the RF to ease the feature subset selection. This implementation of the regularization framework on ran- dom forest known as regularized random forest (RRF) [62] has been recently proposed for the feature selection purposes. Nevertheless, evaluations of the features are usually done by a portion of the training data at each tree node, which might cause in-stability in feature selection procedure. To overcome this problem, an enhanced version of the RRF, referred to as guided RRF has been proposed by Deng [62]. In this method, first the importance scores for each variable from an ordinary random forest is computed. Then those scores are used to guide the feature selection (in this case RRF). Given that the importance scores in ordinary RF are computed on all of the trees and also on all of the training data, the GRRF is likely to perform more efficiently in comparison to RRF in terms of feature selection performance. , Signals and Systems, Master of Science Thesis 2013 47 CHAPTER 5. EVALUATION OF THE PROPOSED TEXTURE FEATURES FOR SCREENING PAP SMEAR SLIDES 5.2.4 L1-regularization procedure with generalised linear model Generalised linear models (GLM) generalise the classical linear models based on the normal distribution. In an ordinary linear model the relation between the response variable Yi and the independent predictor variables Xij is given by: ηi = Yi = β0 + β1Xi1 + β2Xi2 + . . .+ βkXij (5.1) where η is the linear predictor and βs are the coefficients of the linear combination whose values are unknown and have to be estimated. The coefficient βj is the measure of the impact of the predictor Xij on the target Yi. In an ordinary linear model, for each set of values for the predictors, the response variable is continuous and normally distributed [65]. In a generalised linear model, by contrast, response variables belong to the exponen- tial family of distributions, such as the Gaussian, Binomial, Poisson, Gamma, or inverse Gaussian. In addition, the relation between the response and predictor variables does not necessarily have the simple linear form as in equation 5.1. In fact, the independent predictive variables can take non-linear transformations of the other predictor variables. However, the linear predictor is still linear in the coefficients (parameters) that need to be estimated. GLMs model the random variable Y by providing a relationship between its expected value and the linear predictor η. ηi = g (µ) = β0 + β1Xi1 + β2Xi2 + . . .+ βkXij (5.2) where g is called the link function. The link function relates the linear predictor η to the expected value of the response variable. The link function for a classical linear model is the identity function g (µ) = µ = E (Y ). Examples of link fun