Subtitle Frequency
Download File === https://urllio.com/2tl8Ac
We present SUBTLEX-PL, Polish word frequencies based on movie subtitles. In two lexical decision experiments, we compare the new measures with frequency estimates derived from another Polish text corpus that includes predominantly written materials. We show that the frequencies derived from the two corpora perform best in predicting human performance in a lexical decision task if used in a complementary way. Our results suggest that the two corpora may have unequal potential for explaining human performance for words in different frequency ranges and that corpora based on written materials severely overestimate frequencies for formal words. We discuss some of the implications of these findings for future studies comparing different frequency estimates. In addition to frequencies for word forms, SUBTLEX-PL includes measures of contextual diversity, part-of-speech-specific word frequencies, frequencies of associated lemmas, and word bigrams, providing researchers with necessary tools for conducting psycholinguistic research in Polish. The database is freely available for research purposes and may be downloaded from the authors' university Web site at -pl .
Word frequency is an important variable in cognitive processing. High-frequency words are perceived and produced faster and more efficiently than low-frequency words. At the same time, they are easier to recall but more difficult to recognize in episodic memory tasks.
To investigate the word frequency effect or to match stimuli on word frequency, psychologists need estimates of how often words occur in a language. In American English the Kucera and Francis (KF) frequencies have become the norm. This is surprising because the KF frequencies are dated (from 1967) and based on a corpus of 1.014 million words only. Several studies have confirmed the bad quality of the Kucera and Francis word frequencies (Burgess & Livesay, 1998; Zevin & Seidenberg, 2002; Balota et al., 2004).
Another word frequency measure regularly used is based on the Celex database (Baayen, Piepenbrock, & van Rijn, 1993). This measure is better than Kucera and Francis, but not optimal either (Balota et al., 2004; Zevin & Seidenberg, 2002).
To assess the quality of a frequency measure, one needs word processing times. These have become available as part of the Elexicon project ( ). Brysbaert & New (Behavior Research Methods, in press) calculated the percentages of variance accounted for by Kucera and Francis, and Celex in the accuracies and reactions times of a lexical decision task.
In Van Heuven, Mandera, Keuleers, & Brysbaert (QJEP, 2014) we proposed a new frequency measure, the Zipf scale, which is much easier to understand than the usual frequency measures. Zipf values range from 1 to 7, with the values 1-3 indicating low-frequency words (with frequencies of 1 per million words and lower) and the values 4-7 indicating high-frequency words (with frequencies of 10 per million words and higher). Download a zipped Excel file of SUBTLEX-US with the Zipf values included.
We examined the potential advantage of the lexical databases using subtitles and present SUBTLEX-PT, a new lexical database for 132,710 Portuguese words obtained from a 78 million corpus based on film and television series subtitles, offering word frequency and contextual diversity measures. Additionally we validated SUBTLEX-PT with a lexical decision study involving 1920 Portuguese words (and 1920 nonwords) with different lengths in letters (M = 6.89, SD = 2.10) and syllables (M = 2.99, SD = 0.94). Multiple regression analyses on latency and accuracy data were conducted to compare the proportion of variance explained by the Portuguese subtitle word frequency measures with that accounted by the recent written-word frequency database (Procura-PALavras; P-PAL; Soares, Iriarte, et al., 2014 ). As its international counterparts, SUBTLEX-PT explains approximately 15% more of the variance in the lexical decision performance of young adults than the P-PAL database. Moreover, in line with recent studies, contextual diversity accounted for approximately 2% more of the variance in participants' reading performance than the raw frequency counts obtained from subtitles. SUBTLEX-PT is freely available for research purposes (at -pal.di.uminho.pt/about/databases ).
Table 1. Mean word frequency (per million), number of orthographic neighbors (N), word length (number of characters) and subjective concreteness ratings of the words used in Experiment 1 (standard deviations are given within parentheses).
Table 3. Percentages of reaction times (RT) and error rate (E%) variance (adjusted R2) corresponding to Experiment 1 explained by the different frequency measures with and without word length (L) and number of syllables (NSyl) as additional variables.
Table 5. Percentages of reaction times (RT) and error rate (E%) variance (adjusted R2) corresponding to Experiment 2 explained by the different frequency measures with and without word length (L) and number of syllables (NSyl) as additional variables.
As expected, these norms explain 3% more variance in the lexical decision times of the British Lexicon Project than the SUBTLEX-US word frequencies. They also explain 4% more variance than the word frequencies based on the British National Corpus, further confirming the superiority of subtitle-based word frequencies over written-text-based word frequencies for psycholinguistic research. In contrast, the word frequency norms explain 2% variance less in the English Lexicon Project than the SUBTLEX-US norms.
The SUBTLEX-UK word frequencies are based on a corpus of 201.3 million words from 45,099 BBC broadcasts. There are separate measures for pre-school children (the Cbeebies channel) and primary school children (the CBBC channel). For the first time we also present the word frequencies as Zipf-values, which are very easy to understand (values 1-3 = low frequency words; 4-7 = high frequency words) and which we hope will become the new standard.
Van Heuven, W.J.B., Mandera, P., Keuleers, E., & Brysbaert, M. (2014). Subtlex-UK: A new and improved word frequency database for British English. Quarterly Journal of Experimental Psychology, 67, 1176-1190. pdf
We examine the use of film subtitles as an approximation of word frequencies in human interactions. Because subtitle files are widely available on the Internet, they may present a fast and easy way to obtain word frequency measures in language registers other than text writing. We compiled a corpus of 52 million French words, coming from a variety of films. Frequency measures based on this corpus compared well to other spoken and written frequency measures, and explained variance in lexical decision times in addition to what is accounted for by the available French written frequency measures.
Lexical frequency is one of the strongest predictors of word processing time. The frequencies are often calculated from book-based corpora, or more recently from subtitle-based corpora. We present new frequencies based on Twitter, blog posts, or newspapers for 66 languages. We show that these frequencies predict lexical decision reaction times similar to the already existing frequencies, or even better than them. These new frequencies are freely available and may be downloaded from
The number of occurrences of a word within a corpus is one of the best predictor of word processing time (Howes & Solomon, 1951). High-frequency words are processed more accurately and more rapidly than low-frequency words, both in comprehension and in production (Baayen, Feldman, & Schreuder, 2006; Balota, Cortese, Sergent-Marshall, Spieler, & Yap, 2004; Monsell, 1991; Yap & Balota, 2009). This word frequency effect was observed in different tasks such as lexical decision tasks (Andrews & Heathcote, 2001; Balota et al., 2004), perceptual identification tasks (Grainger & Jacobs, 1996; Howes & Solomon, 1951), pronunciation tasks (Balota & Chumbley, 1985; Forster & Chambers, 1973), and semantic categorization tasks (Andrews & Heathcote, 2001; Taft & van Graan, 1998). This word frequency effect is a robust effect, since it was found in many languages.
More recently, another source of corpora was found to be reliable: movie subtitles. The subtitle-based frequencies were first computed in French by New, Brysbaert, Véronis, and Pallier (2007). The authors showed two main results. First, they showed that the subtitle-based frequencies were a better predictor of reaction times than the book-based frequencies. Second, the subtitle-based frequencies were complementary to book-based frequencies. For instance, typical words from spoken language in everyday life were much more frequent in the subtitle-based than in the book-based corpora. Because the book-based and subtitle-based frequencies were shown to be complementary in the analyses (they explained more variance together than separately), the authors concluded that book-based frequencies could be good estimates of written language and that subtitle-based frequencies could be good estimates of spoken language. The subtitle-based frequencies were then created in other languages in which these results have been replicated, such as English (Brysbaert & New, 2009), Dutch (Keuleers, Brysbaert, & New, 2010), Chinese (Cai & Brysbaert, 2010), Greek (Dimitropoulou, Duñabeitia, Avilés, Corral, & Carreiras, 2010), Spanish (Cuetos, Glez-Nosti, Barbon, & Brysbaert, 2011), German (Brysbaert, Buchmeier, Conrad, Jacobs, Bölte, & Böhl, 2011), British (van Heuven, Mandera, Keuleers, & Brysbaert, 2014), and Polish (Mandera, Keuleers, Wodniecka, & Brysbaert, 2015).
Another source that has yielded good frequency measures was the Internet. The Internet presents two advantages: It is easier to get a large corpus from the Internet than from books (since there is no need to scan documents). Second, the language used on the Internet is more varied than the language in books. Lund and Burgess (1996) proposed a corpus (named HAL) based on approximately 160 million words taken from Usenet newsgroups. Burgess and Livesay (1998) found that the word frequencies from HAL were a better predictor of lexical decision times than the Kučera and Francis (1967) frequencies. Balota, Cortese, Sergent-Marshall, Spieler, and Yap (2004) reached the same conclusion, and they recommended the HAL frequencies (Balota et al., 2007). According to Balota et al. (2004), the poor performance of the Kučera and Francis frequencies was largely due to the small size of the corpus. In order to investigate this question of the importance of the size of the corpus, Brysbaert and New (2009) selected various sections of the British National Corpus (Leech, Rayson, & Wilson, 2001). The sections were of different sizes (from 0.5 million words to 88 million words). They then correlated the word frequencies in the different sections with the reaction times in the English Lexicon Project (Balota et al., 2007). They showed that the percentage of variance accounted for in the lexical decision times reached its peak when the section size was of 16 million words, especially for low-frequency words. The conclusion was that a corpus of 16 million words seems to be sufficient. 59ce067264
https://www.senorrio.com/group/senor-rio-group/discussion/9d278f6d-7559-419c-8e74-a563e244f473