Selection of Texts
The full amount of material recorded by the BDLT team totals some 250 hours. From this material, segments were selected to be transcribed as texts and included in the database. The selections, made primarily by Zhobov, were done with two criteria in mind. First, we desired each text to illustrate as many of the salient features of the local dialect as possible; and second, we required that each text constitute a well-formed instance of discourse and, ideally, give some insight into village life. Several texts constitute narratives, either of a folktale or a personal experience. Most, however, consist of conversations about the way things were done in “the old days”. As is common in ethnographic work, turning the conversation in this direction is an effective way not only to garner valuable ethnographic information (about holiday customs, agricultural practices, food preparation and the like), but also functions to direct the speaker’s attention away from the present (and thus to minimize the influence of the standard language).
Most texts concentrate on the speech of a single informant. In some instances, the voices of other speakers can also be heard. Usually these are in the background, although in some cases the speech on the text consists of are two (and rarely) three speakers in dialogue with one another. In certain instances these segments were chosen deliberately because the topic and the liveliness of the conversation provided especially vivid examples of natural dialectal speech.
In keeping with the project goal of presenting natural speech, there has been minimal editing of the tapes. This means that natural environmental noises such as crowing roosters or bleating goats are sometimes heard, as are sounds of traffic. Although such interference has been kept to a minimum, the overriding criterion has been to choose a text that is “good” in terms of its content and linguistic features, and not to worry about the occasional truck, goat, or rooster.
Preparation of Texts for Data Entry
Digitization of the original field recordings was done in Sofia. Once the audio files which form the basis of the present compendium were selected, they were transcribed according to principles described at the end of this page. Recordings made after 1990 purposefully include the entire conversational context, and the transcriptions include everything on the tape – not only speech but also non-verbal cues such as laughter, coughing, and onomatopoetic sounds. Only background conversation which is not relevant to the conversational interaction has been excluded. Recordings made prior to 1990 include only speech by the informant. In these instances, the investigator’s participation in the conversation has been reconstructed and included into the transcriptions: these utterances are always enclosed in brackets to emphasize the fact that they are reconstructions.
Each text is named after the location in which it was recorded, and divided into lines, which are numbered for the purposes of data retrieval. Each new turn in the conversation occupies a separate line. Longer segments by the same speaker occupy continuous stretches of lines; wherever possible, line breaks correspond to natural rhythmic or syntactic breaks. The beginning of each line contains a code identifying the speaker: informants (whose identity is kept anonymous), are identified simply by lower case letters. Investigators, by contrast, are identified by their first and last initials, and their full names are given in the text’s metadata. The English translations of the text aim at a rendering not only of the content but also the style of the original, to the extent possible. Literal translations of most words are given in the interlinear glosses.
Each text appears in three "views". The Glossed view includes the English translation of the text, the spoken Bulgarian in Latin transcription (described under Principles of Transcription below), and the several glosses of each token. The Line view includes only the transcribed Bulgarian text and its English translation, and is intended for those who wish to read the text for content only. The Cyrillic line view includes only the text, transcribed according to accepted Bulgarian transcription conventions.
Each token in a text has its own page, which specifies all the tags assigned to that token, and identifies all other instances of that token on the site. Each line in a text, also has its own page, listing all the the relevant data associated with that line. Token pages are accessed by clicking on a token; line pages are accessed by clicking on the line in Line view.
Each village name on the home page is associated with a Location page; this contains a map locating the village in question, as well as a detailed description of the dialect of that village, with cross-referenced examples drawn from texts on the site. For each such description, there is a link to a Bulgarian version with the same content.
Annotation of Texts for Data Retrieval
The site has two main goals: to present unedited Bulgarian dialect speech in its natural form (transcribed and translated so as also to be accessible to those outside Bulgaria), and to provide a research tool for scholars, to help them in mining the rich data source which these texts provide. Consequently, the texts were annotated at five different levels in order to allow data retrieval at each level.
• At the Wordform level, each token is provided with tags identifying major grammatical features such as case, number, gender, tense, aspect or definiteness. Also noted are a number of pragmatic or discourse features such as exclamation, backchannelling, and the like. Each token with lexical meaning is also provided with an English gloss. All these tags are in the Latin alphabet, and appear underneath each token in Glossed text view. The site can then be searched for words bearing any combination of these tags.
• At the Lexeme level, each token is provided with a tag correlating it with the relevant dictionary form (or lemma) from Standard Bulgarian; these tags are in the Cyrillic alphabet. Tokens for which there was no corresponding standard lemma are provided with a "dialectal lexeme", a form created by the investigators according to standard dictionary conventions (indefinite for nouns, masculine indefinite for adjectives, 1st singular present for verbs). These tags appear in Glossed text view, underneath the Wordform tags of each token.
• At the Linguistic trait level, certain tokens were provided with one or more tags identifying traits of interest for more detailed linguistic analysis. Most are descriptive, but some concern the synchronic development of historical Slavic vowels, consonants or sequences. The actual tags are in shorthand, and the tags assigned to any one token can be found by clicking on any one token to retrieve its Token page; clicking on the shorthand tag itself will retrieve the full definition. Full definitions of the tags are given at the search level; these searches allow researchers to retrieve considerable amounts of valuable information.
• At the Thematic content level, each line of text was provided with one or more tags identifying the topic of conversation. By this means, those interested in the ethnographic content of the site’s texts can retrieve chunks of lines containing speech referring to specific topics. Thematic tags which have been assigned to any one line appear at the bottom of the relevant Line page.
• At the Phrase level, tags are assigned neither to tokens or to lines, but rather to "phrases" – grammatically signficant sequences of words, the meaning of which is impossible to tag at the level of an individual token. The tags themselves are identified at the search level; the individual phrases identified for any one line can be found on the relevant Line page.
Principles of Transcription
Because one of the site’s major goals is to make Bulgarian dialectal data available to the broader public outside Bulgaria, the primary transcription system is based on the Latin alphabet. All texts have also been transcribed in Cyrillic, according to the accepted system used by Bulgarian dialectologists.
The primary transcription system is a combination of symbols adapted to the specific requirements of Bulgarian dialectal data. Certain symbols are taken from the International Phonetic Alphabet (IPA) and others from the academic transliteration that is the norm among Slavists. Where phonetic precision is needed in order to render important dialectal distinctions, IPA symbols are used. Elsewhere, simplified forms are used in order to make the transcription more accessible to non-phoneticians. The following symbols are used on the site:
Consonants
Latin |
(Comment) |
Cyrillic |
(Коментар) |
---|---|---|---|
b | voiced bilabial stop | б | |
c | voiceless alveolar affricate | ц | |
č | voiceless postalveolar affricate | ч | |
d | voiced alveolar stop | д | |
dz | voiced alveolar affricate | ѕ | звучен алвеодентален африкат |
dž | voiced postalveolar affricate | џ | звучен преднонебен африкат |
f | voiceless labiodental fricative | ф | |
ɸ | voiceless bilabial fricative | φ | билабиално /ф/ |
g | voiced velar stop | г | |
h | laryngeal fricative | h | ларингално /х/ |
j | palatal approximant | й | |
k | voiceless velar stop | к | |
l | alveolar lateral | л | |
m | bilabial nasal | м | |
n | alveolar nasal | н | |
p | voiceless bilabial stop | п | |
r | alveolar trill | р | |
s | voiceless alveolar fricative | с | |
š | voiceless postalveolar fricative | ш | |
t | voiceless alveolar stop | т | |
v | voiced labiodental fricative | в | |
β | voiced bilabial fricative | w | билабиално /в/ |
w | labiovelar approximant | ў | консонантно /у/ |
x | voiceless velar fricative | х | |
ž | voiced postalveolar fricative | ж |
Vowels
Latin |
(Comment) |
Cyrillic |
(Коментар) |
---|---|---|---|
a | central open unrounded | a | |
ɑ | back open unrounded | а̊ | задна отворена нелабиална |
e | front mid-closed unrounded | е | |
ɛ | front mid-open unrounded | е̂ | широко /е/ |
e̝ | raised e | е̇ | тясно /е/ |
i | front closed unrounded | e | |
ɨ | central closed unrounded | ɨ | ери /ы/ |
y | front closed rounded | ӥ | лабиализувано /и/ |
о | back mid-closed rounded | о | |
ɔ | back mid-open rounded | о̂ | широко /о/ |
o̝ | raised o | о̇ | тясно /о/ |
u | back closed rounded | у | |
ɤ | central mid-closed unrounded | ъ | |
ʌ | back mid-open unrounded | ъ̂ | широко /ъ/ |
ɯ | back closed unrounded | ъ̴ | веларно /ъ/ |
ə | mid-central unrounded | ə | шваa (/ə/) |
ə̝ | above, but fronted | е̥ | Милетичево /е/ |