Let’s say you’re an English speaker, reading a text, and you run into a new word. Should you ever have to pronounce that word out loud, short of looking it up in a dictionary, you’ll draw on your knowledge of English – especially the correspondence between certain (combinations of) letters and the sounds they make. So far, so good, but sometimes there’s conflicting evidence in the lexicon. And while we like to think we all speak The Same English, in reality “your own personal English” is ever-so-different from someone else’s because your experiences with language are unique. Fortunately, this doesn’t usually keep us from understanding one another. For instance, when I moved to Canada and heard the first syllable of “pasta” being pronounced “pass” (and not with the vowel of “paw,” like I was used to), it was certainly new and notable for me, but it didn’t lead to any confusion. That difference you can easily pin down to regionality – you hear a certain pronunciation, and you can guess a thing or two about the person.
The “gif debate” strikes me as similar, but with a crucial distinction: I can’t think of any easy “group signalling” (intentional or otherwise) that either pronunciation brings with it, beyond simply the pronunciation itself. Saying it [dʒɪf] signals you’re a [dʒɪf]-er, pure and simple. You can’t tell where someone’s from, what socioeconomic class they might fit into, their sex or gender, etc., just from that initial consonant. If we could, I doubt we’d be talking about which form is “right” or “wrong”. Instead, we’d probably (imperfect beings that we are) fall into some pretty harmful traps if that distinction fell along some power differential. (Just look at “vocal fry.” Men use it more than women, but because it’s associated with women’s speech, it can seen as undesirable, unprofessional, and so on. As some have insightfully remarked, “Vocal fry is the new ‘shrill’.”)
With this post, I want to probe one aspect of why that initial, subconscious decision is made – when you first saw “gif”, why you heard it [gɪf] or [dʒɪf]. And to do that, we need to look at patterns and regularities in the English lexicon. I’m going to disappoint you right away: both forms can be justified, and because English speakers understand both of them, implicitly, both forms are “right.” The debate may be fun, but I think we all know deep down that nothing will ever definitively prove one over the other.
So what’s the evidence for each? Let’s start with the boring stuff…
The data come from the English Lexicon Project and consist of a text-based file of nearly 80,000 orthographic forms, their pronunciation and other information such as the number of syllables and frequency. “Frequency” should be read as the number of occurrences of a word in a given (very large) corpus. We’ll be using a log-transformed frequency, so keep in mind that what might look like small differences can in fact be very big.
The main processing starts by subsetting all words in English containing the sequence of letters “gi”. We then lemmatize them to weed out related forms (like “talk” and “talking”, etc.) and reduce them to a single, simple form. (Another, more complicated, manual process helps do this further. It removes a lot of adverbs and reduces pairs like “gilder” and “gilded”.) We then use Regular Expressions (think of these like a fancy “search and replace” that can generalize instead of performing only exact searches) to extract the phonetic transcription consonant and the vowel corresponding to these “gi” letter sequences. (At this stage, there’s no more room for ambiguity; if it returns “g” for the consonant, it means the [g] of “good”.)
We still need to get rid of some unfair forms that are overrepresented in the data. Not saying these offer no insight to our instincts – in fact, I think we should keep these in mind. However, they tend to obscure some more important trends. For instance, the present participle (“-ing” forms) of verbs ending in “ge” (like “arrange”) make up a lot of [dʒ] forms. Is this important to know? Yes! It means we encounter this mapping of “gi” to [dʒɪ] a lot inside words. Does it give us a deeper understanding of the number of unique forms in English corresponding to a certain consonant over the other? Not really. The same goes for all the “-ology” and “-ologist” forms, too.
What does that leave us with? 269 forms. Not a lot, but keep in mind that these are unique, simple forms. Now, let’s talk about competing evidence, starting with…
This is a term I’m pulling out of thin air. I think. But it means this: [g] has an advantage over [dʒ] at the beginning of words, both in number of forms and in their frequency. That is, there are more word-initial “gi” forms whose consonant corresponds to [g] (21) than to [dʒ] (15). Those [g]-initial words are also much more frequent (average log frequency of 6.03 vs. 5.17 for [dʒ]), thanks in large part to “give” and “girl”. (Want the raw frequency numbers? It’s 14198.57 vs. 2874.4! Thanks again to “give”.) “Giant” and “ginger” are the most common words in the [dʒ] camp. While “giant” comes close to “girl” in terms of frequency, neither come close to “give,” and “ginger” trails far behind. I think it’s also worth noting that the vowel of “give” reinforces [gɪ]. “Giant” doesn’t (but then again, neither does “girl”.)
One additional bit of evidence for [g] looking backwards, from the pronunciation of [gɪ] and [dʒɪ] towards letters (so, not just words containing the written letters “gi”). At the beginning of words, [gɪ] is spelled “gi” more often than not (e.g., “gui” or “gea”), while [dʒɪ] is more often spelled “ji” than “gi”. (Note that while I didn’t process this part of the data for related forms, there are few enough of them that you can eye them.)
This might sting for you [gɪf]-ers. Outside of word-initial position, everything’s flipped. [dʒ] forms are more numerous (180, vs. 43 for [g]) and common (average log frequency: 6.06, vs. 5.2 for [g]). Even if you categorize words into “bins” of frequency (small, medium, large), the trend is the same: [dʒ] has more forms, by far. (The effect is most pronounced in words with middling frequency.) You can see this in the graph below. Word-initial “gi” forms are included as a point of comparison.
While it’s harder to gague the correspondence working backwards, the fact that the letter “j” is more frequent at the beginning of words than inside words should be an indication that internal [dʒɪ] maps fairly frequently to “gi”. (Of the 230 unprocessed forms containing internal [dʒɪ], only 9 had the letter “j” anywhere in their written forms. Only 23 were missing the “gi” letter sequence.)
Finally, remember those forms I cut out? All the “arranging”, the “-ologies” and so on? Their regularity and productivity (you can turn anything into an “-ology”) keep piling on extra “gi” ↔ [dʒɪ] correspondence inside words.
Let’s step back for a moment and ask: what’s “right” and “wrong” when it comes to stakes like these, the pronunciation of a word? Is it what an authority said (for those of you who invoke Steve Wilhite)? Tempting as that might be, I advance (and I think most, if not all linguists would agree with me) that “right” is simply an intelligible form implicitly understood by a speech community. That is, it’s right because they agree it’s right. There may be “right within our community” vs. “right in some other community that I can gague” (like dialectal differences), but there’s also “wrong” in the sense of “I don’t know anyone who says it that way.” Dare I say, for instance, that we native English speakers are all in agreement that the “i” in “gif” doesn’t make the vowel [i] as in “beach.” The evidence is just that strong and consistent in English that the letter “i” makes the sound [ɪ] when it’s in a syllable that ends with a consonant (or at least a consonant like [f]). Change the spelling to “geaf,” and our intuitions change.
Bottom line is, hard as it is to accept, language actually is a democracy. That’s Linguistics 101, when we talk about the arbitrariness of capital-L Language. And between the [dʒɪf]-ers and [gɪf]-ers, really all we differ in is what generalizations we made from our vocabularies, in order to decide how the word “sounds to us.” Some of us pay attention to word position, others to trends in the lexicon as a whole. The rest is just after-the-fact attempts to justify ourselves and have some fun along the way.
…all that said, it’s [dʒɪf]. #sorrynotsorry
]]>I won’t rehash the history of the Pinker letter. I assume if you’re reading this, you know what it is, and you know the controversy surrounding it.
The point of this post is pure and simple. On several occasions, in apparent efforts to discredit “the Letter,” Steven Pinker has alluded to the academic status of its signatories. Most recently, on the BBC radio program “World at One” with Sarah Montague, he claimed, “Most of them were graduate students and lecturers… by no means an indication of the sentiment of professional linguists.”
I won’t go into the many ways in which I find that statement problematic. I defer to my colleagues for that one (here’s a recent example from Dr. Caitlin Green).
What I want to point out here is that this statement is, to give the benefit of the doubt, an oversimplification. At worst, misleading or disingenuous. As other linguists have pointed out, 7 LSA Fellows figure among the signatories of the Letter, and there are definitely professors to be found on the list.
So let’s quantify who signed the Letter, to put this to rest once and for all. With reproducible R code (see the very end of the post)! This is a very simple task. The data were imported from a local CSV file as sign
.
I then generated a list of all unique text in the “Role” column and manually classified them as either professor, student, lecturer, researcher or other. I hope I didn’t step on any toes with this one. I looked up a couple of ambiguous titles, but most I went with what was written. Again, you can see these categories in the code. Some particularities:
Note that this exercise felt pretty weird to me. Academic rank is a strange thing, and this whole classification scheme felt a little unsavory and demeaning. Anyway. Once those roles are defined, we make a lookup dataframe and match people with their generalized roles. (I reordered the factor to make the bar graph pretty.)
Let’s take a look at the values:
124 names lacked a title. These were excluded from the graph. If anyone wants to look them up, have fun.
Here are those results in a bar graph.
Take from this what you will. Don’t @ me.
I want to be crystal clear about one thing: even if the Letter was written exclusively by students, that shouldn’t discredit their claims a priori. A lot of Pinker defenders criticize the Letter and its proponents as levelling ad hominem attacks against Pinker – what is this, if not that? Pinker is using the identities of the Letter signatories as a tactic to invalidate its claims. Those claims should be addressed, rather than nitpicking or attacking an often arbitrary status. (Have you seen the academic job market??)
I’m going to stop there before I go off in a tangent. Again, this post is just to quantify who signed the letter, because some people think that matters.
I even hesitated to write this post for fear of giving credence to this kind of argument. But in the end, I wanted to show that, playing by these (harmful!) rules, dismissing the Letter based on the status of the people who signed it is not as easy as, ahem, certain people have made it out to be.
Let’s look at the script first (comments will follow), assuming a simple sample transcription "I don’t watch football" with some room for variation. Namely, let’s say we’ll accept:
Note that even with this small text, with all the variants accepted, we have 144 model transcriptions! Variables can, of course, be trimmed down with more specific instruction. (If absolutely necessary, you can randomly pick an n-sized sample of your model transcriptions.) In the retroaction file, you may wish to limit the model transcriptions to one or only a few (there can be many identified). You can print only the first one, for instance, in the feedback file by replacing a[i,]$var
with gsub("^(.+?) / .+", "\\1", a[i,]$var)
. Or you could replace the definition of corr
above to paste(head(corrige.a[adist(corrige.a, text)==min(adist(corrige.a, text))],3), collapse=" / ")
to give you, say, the first 3 matching models.
This script assumes a typical Moodle assignment structure, i.e., a folder with a sub-folder for each student (which is also named after each student) which includes their submission. Note that if students entered the text in an online form, rather than submitting a .txt file, you’ll have to convert their submissions from .html to .txt.
Setting up the exercise takes some care. In particular, students have to:
Note that this sample script works on only the first line, i.e., transcription. You can, of course, define all your model transcriptions and expand the for
loop (or make a separate, dedicated version for each transcription) for each line in the student file.
Failing to respect any one of these instructions can lead to a very low score. Some clean-up is done in the script, but it’s impossible to predict every type of error, so papers with abnormally low scores should be dealt with individually. To give just one example, a handful of papers for one assignment had some trailing or "invisible" combining characters (like vowel nasalization) invisible to the naked eye; if memory serves, this can arise if the diacritic is added twice or more, or if at the beginning of a line, the diacritic is added before the character and then added again after the character.
The current script takes a somewhat "brute strength" approach to errors. That is, it doesn’t distinguish the gravity or type of errors; it only gauges the similarity of the student text to the closest model text. I don’t see this as a major problem for introductory levels, especially given the precautions detailed above. However, for intermediate and more advanced transcriptions, I plan on implementing a basic notion of featural similarity that can be context-dependent. For example, errors in place of articulation can be less serious if the major classes are correct, and could be even less in potential contexts for place assimilation. But that’s a project for this summer!
]]>jsonlite
package handles .jsonl files.
The following code loops through all the .jsonl files in a folder, extracts only the fields you request and provides you with uniform .csv files which can then later be re-imported in R without any of the problems you may have run into working with Twitter data in this format. Some commentary on the code follows.
In this database at least, each line of the .jsonl (that is, each tweet) has a variable amount of information, from 200 to – on rare occasions – upwards 600 fields. In my study, I only need about 30, some of which (like geographic information) may additionally be entirely absent from a given tweet. As of writing, the fromJSON()
call from the jsonlite
package automatically takes care of this when the flatten = TRUE
option is specified, but only with standard .json files. Trying to run this command on an entire .jsonl file will run into an error (“trailing garbage”) at the first line break. Note also that this command now returns a data frame from .json files but lists from individual .jsonl lines.
We could work around this by brute-forcing a conversion from each .jsonl to .json, for example, by replacing all line breaks (save for the last) with a comma and by placing the entire string between [ ], but this can be a time-consuming process and proves ultimately unnecessary.
As some solutions on StackOverflow suggest, we could import and unlist each tweet individually in order to build a database iteratively in a for
loop. As this will provide us named character vectors of differing lengths, we can simply name in advance those elements that we want to retain (if present) or append (if absent) by creating a vector of field names (called cols
here) and subsetting the output of unlist()
. Since the names of those fields absent from a given tweet will be missing (in addition to their values, obviously), we have to replace the names of the vector with cols
. Running this on every line of each .jsonl file using lapply()
, we get uniform vectors that we can ultimately bind together with do.call()
. The result is a single, large matrix which we’ll be converting to a data frame and saving as a .csv file. (Because a lot can go wrong between exporting and (re-)importing .csv files of Twitter data, this script uses the save_as_csv
function from the rtweet
package.)
We then loop this over all the files in a given folder, and a message tells us when each file is done. Finally, file.remove()
can be called to permanently delete the .jsonl files if memory is an issue.
A good take-away from this code is the relative efficiency of the apply()
family, in comparison with a for
loop. Benchmarking this code versus a nested for
loop showed the lapply()
approach to be up to 10 times faster, and with large files like these, that can make the difference between 2 and 20 minutes per file!
For resources on why for
loops are slow, try these links, and for a great resource on converting for loops to apply()
functions, look here.