Co-occurrence network

A co-occurrence network created with KH Coder

Co-occurrence network, sometimes referred to as a semantic network,[1] is a method to analyze text that includes a graphic visualization of potential relationships between people, organizations, concepts, biological organisms like bacteria[2] or other entities represented within written material. The generation and visualization of co-occurrence networks has become practical with the advent of electronically stored text compliant to text mining.

By way of definition, co-occurrence networks are the collective interconnection of terms based on their paired presence within a specified unit of text. Networks are generated by connecting pairs of terms using a set of criteria defining co-occurrence. For example, terms A and B may be said to “co-occur” if they both appear in a particular article. Another article may contain terms B and C. Linking A to B and B to C creates a co-occurrence network of these three terms. Rules to define co-occurrence within a text corpus can be set according to desired criteria. For example, a more stringent criteria for co-occurrence may require a pair of terms to appear in the same sentence. Co-occurrence networks were found to be particularly useful to analyze large text and big data, when identifying the main themes and topics (such as in a large number of social media posts), revealing biases in the text (such as biases in news coverage), or even mapping an entire research field.[3]

Methods and development

The process of constructing co-occurrence networks includes identifying keywords in the text, calculating the frequencies of co-occurrences, and analyzing the networks to find central words and clusters of themes in the network.[4]

Word co-occurrence network (range 3 words) for the following sentence: "The dawn is the appearance of light - usually golden, pink or purple - before sunrise"
Co-occurrence network of a bacterial community
in a stream [5]

Co-occurrence networks can be created for any given list of terms (any dictionary) in relation to any collection of texts (any text corpus). Co-occurring pairs of terms can be called “neighbors” and these often group into “neighborhoods” based on their interconnections. Individual terms may have several neighbors. Neighborhoods may connect to one another through at least one individual term or may remain unconnected.

Individual terms are, within the context of text mining, symbolically represented as text strings. In the real world, the entity identified by a term normally has several symbolic representations. It is therefore useful to consider terms as being represented by one primary symbol and up to several synonymous alternative symbols. Occurrence of an individual term is established by searching for each known symbolic representations of the term. The process can be augmented through NLP (natural language processing) algorithms that interrogate segments of text for possible alternatives such as word order, spacing and hyphenation. NLP can also be used to identify sentence structure and categorize text strings according to grammar (for example, categorizing a string of text as a noun based on a preceding string of text known to be an article).

Graphic representation of co-occurrence networks allow them to be visualized and inferences drawn regarding relationships between entities in the domain represented by the dictionary of terms applied to the text corpus. Meaningful visualization normally requires simplifications of the network. For example, networks may be drawn such that the number of neighbors connecting to each term is limited. The criteria for limiting neighbors might be based on the absolute number of co-occurrences or more subtle criteria such as “probability” of co-occurrence or the presence of an intervening descriptive term.

Quantitative aspects of the underlying structure of a co-occurrence network might also be informative, such as the overall number of connections between entities, clustering of entities representing sub-domains, detecting synonyms,[6] etc.

Applications and use

Some working applications of the co-occurrence approach are available to the public through the internet. PubGene is an example of an application that addresses the interests of biomedical community by presenting networks based on the co-occurrence of genetics related terms as these appear in MEDLINE records.[7][8] PubGene's CoreMine Medical has been used in studies relating genes/proteins to potentially effective drugs and drug candidates in multiple sclerosis, [9] fibrosis, [10] and hepatitis. [11] CoreMine Medical was also used in a study of genes implicated in post-traumatic stress disorder. [12]

The website NameBase is an example of how human relationships can be inferred by examining networks constructed from the co-occurrence of personal names in newspapers and other texts (as in Ozgur et al.[13]).

Networks of information are also used to facilitate efforts to organize and focus publicly available information for law enforcement and intelligence purposes (so called "open source intelligence" or OSINT). Related techniques include co-citation networks as well as the analysis of hyperlink and content structure on the internet (such as in the analysis of web sites connected to terrorism[14]).

See also

References

  1. ^ Segev, Elad (2021). Semantic Network Analysis in Social Sciences. London: Routledge. ISBN 9780367636524.
  2. ^ Freilich, Shiri; Kreimer, Anat; Meilijson, Isacc; Gophna, Uri; Sharan, Roded; Ruppin, Eytan (2010-02-27). "The large-scale organization of the bacterial network of ecological co-occurrence interactions". Nucleic Acids Research. 38 (12): 3857–3868. doi:10.1093/nar/gkq118. ISSN 1362-4962. PMC 2896517. PMID 20194113.
  3. ^ Segev, Elad (2021). Semantic Network Analysis in Social Sciences. London: Routledge. ISBN 9780367636524.
  4. ^ Segev, Elad (2020). "Textual network analysis: Detecting prevailing themes and biases in international news and social media". Sociology Compass. 14 (4). doi:10.1111/soc4.12779. S2CID 212890998.
  5. ^ Liu, Yang; Qu, Xiaodong; Elser, James J.; Peng, Wenqi; Zhang, Min; Ren, Ze; Zhang, Haiping; Zhang, Yuhang; Yang, Hua (2019). "Impact of Nutrient and Stoichiometry Gradients on Microbial Assemblages in Erhai Lake and Its Input Streams". Water. 11 (8): 1711. doi:10.3390/w11081711.
  6. ^ Cohen, AM; Hersh, WR; Dubay, C; Spackman, K (2005). "Using co-occurrence network structure to extract synonymous gene and protein names from MEDLINE abstracts". BMC Bioinformatics. 6 (1): 103. doi:10.1186/1471-2105-6-103. ISSN 1471-2105. PMC 1090552. PMID 15847682.
  7. ^ Jenssen, Tor-Kristian; Lægreid, Astrid; Komorowski, Jan; Hovig, Eivind (2001-05-01). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28 (1): 21–28. doi:10.1038/ng0501-21. ISSN 1061-4036. PMID 11326270. S2CID 8889284.
  8. ^ Grivell, L. (2002-03-01). "Mining the bibliome: searching for a needle in a haystack?: New computing tools are needed to effectively scan the growing amount of scientific literature for useful information". EMBO Reports. 3 (3): 200–203. doi:10.1093/embo-reports/kvf059. ISSN 1469-221X. PMC 1084023. PMID 11882534.
  9. ^ Dadashkhan, Sadaf; Seyed Amir, Mirmotalebisohi; Poursheykhi, Hossein; Sameni, Marzieh; Ghani, Sepideh; Abbasi, Maryam; Kalantari, Sima; Zali, Hakimeh (2023). "Deciphering crucial genes in multiple sclerosis pathogenesis and drug repurposing: A systems biology approach". J Proteomics. 280 (104890). doi:10.1016/j.jprot.2023.104890. PMID 36966969.
  10. ^ Wilson, Ava C; Chiles, Joe; Ashish, Shah; Chanda, Diptiman; Kumar, Preeti L; Mobley, James A; Neptune, Enid R; Thannickal, Victor J; McDonald, Merry-Lynn N (2022). "Integrated bioinformatics analysis identifies established and novel TGFβ1-regulated genes modulated by anti-fibrotic drugs". Sci Rep. 12 (1): 3080. Bibcode:2022NatSR..12.3080W. doi:10.1038/s41598-022-07151-1. PMC 8866468. PMID 35197532.
  11. ^ Li, Shenghao; Hao, Liyuan; Hu, Xiaoyu; Li, Luya (2023). "A systematic study on the treatment of hepatitis B-related hepatocellular carcinoma with drugs based on bioinformatics and key target reverse network pharmacology and experimental verification". Infect Agent Cancer. 18 (1): 41. doi:10.1186/s13027-023-00520-z. PMC 10315056. PMID 37393234.
  12. ^ Bian, Yao-Yao; Yang, Li-Li; Zhang, Bin; Li, Wen; Li, Zheng-Jun; Li, Wen-Lin; Zeng, Li (2020). "Identification of key genes involved in post-traumatic stress disorder: Evidence from bioinformatics analysis". World J Psychiatry. 10 (12): 286–298. doi:10.5498/wjp.v10.i12.286. PMC 7754529. PMID 33392005.
  13. ^ Ozgur A, Cetin B, Bingol H: “Co-occurrence Network of Reuters News” (15 Dec 2007) https://arxiv.org/abs/0712.2491
  14. ^ Yilu Zhou; Reid, E.; Jialun Qin; Hsinchun Chen; Guanpi Lai (2018-05-22). "US domestic extremist groups on the Web: link and content analysis". IEEE Intelligent Systems. 20 (5): 44–51. doi:10.1109/MIS.2005.96. S2CID 15687907.