Selection of Texts
The full amount of material recorded by the BDLT team totals some 200 hours. Segments were then selected to be transcribed as texts and included in the database. Two criteria guided the selection of texts (made primarily by Zhobov). First, each text should illustrate as many of the salient features of the local dialect as possible; and second, each text should constitute a well-formed instance of discourse and give some insight into village life. Several texts constitute narratives, either of a folktale or a personal experience. Most, however, consist of conversations about the way things were done in “the old days”. As is common in ethnographic work, turning the conversation in this direction is an effective way not only to garner valuable ethnographic information (about holiday customs, agricultural practices, food preparation, and the like), but also functions to direct the speaker’s attention away from the present (and thus minimizes the influence of the standard language).
Most texts concentrate on the speech of a single informant. In some instances, the voices of other speakers can also be heard. Usually they are in the background, although in some cases there are two (and rarely) three speakers in dialogue with one another. In certain instances these segments were chosen deliberately because the topic and the liveliness of the conversation provided especially vivid examples of natural dialectal speech.
In keeping with the project goal of presenting natural speech, there has been minimal editing of the tapes. This means that natural environmental noises such as crowing roosters or bleating goats are sometimes heard, as are sounds of traffic. Although such interference has been kept to a minimum, the overriding criterion has been to choose a text that is “good” in terms of its content and linguistic features, and not to worry about the occasional truck, goat, or rooster.
Preparation of Texts for Data Entry
Digitization of the original field recordings was done in Sofia. Once the audio files which form the basis of the present compendium were selected, they were transcribed according to principles described at the end of this page. Recordings made after 1990 purposefully include the entire conversational context, and the transcriptions include everything on the tape – not only speech but also non-verbal cues such as laughter, coughing, and onomatopoetic sounds. Only background conversation which is not relevant to the conversational interaction has been excluded. Recordings made prior to 1990 include only speech by the informant. In these instances, the investigator’s participation in the conversation has been reconstructed and included into the transcriptions: these utterances are always enclosed in brackets to emphasize the fact that they are reconstructions. Each text is named after the location in which it was recorded.
The texts are divided into lines, which are numbered for the purposes of data retrieval. Each new turn in the conversation occupies a separate line. Longer segments by the same speaker occupy continuous stretches of lines; wherever possible line breaks correspond to natural rhythmic or syntactic breaks. Each speaker is identified at the beginning of each line. The identity of informants is kept anonymous; they are identified simply by lower case letters. Investigators are identified by their first and last initials, and their full names are given in the text’s metadata. The English translations of the text aim at a rendering not only of the content but also the style of the original – to the extent possible. Literal translations of most words are given in the interlinear glosses.
Each Location page contains a description of the dialect of that village, with cross-referenced examples drawn from texts on the site.
Annotation of Texts for Data Retrieval
The site has two main goals: to present unedited Bulgarian dialect speech in its natural form (transcribed and translated so as also to be accessible to those outside Bulgaria), and to provide a research tool for scholars, to help them in mining the rich data source which these texts provide. Consequently, the texts were annotated at five different levels so as to allow data retrieval at each level.
• At the Wordform level, each token was provided with tags identifying major grammatical features such as case, number, gender, tense, aspect or definiteness, as well as a number of pragmatic or discourse features such as exclamation, backchannelling, and the like. Each token with lexical meaning was also provided with an English gloss. All these tags are in the Latin alphabet. These tags appear underneath each token in Glossed text view.
• At the Lexeme level, each token was provided with a tag correlating it with the relevant dictionary form (or lemma) from Standard Bulgarian; these tags are in the Cyrillic alphabet. Tokens for which there was no corresponding standard lemma were provided with a "dialectal lexeme", a form created by the investigators and given according to standard dictionary conventions (indefinite for nouns, masculine indefinite for adjectives, 1st singular present for verbs). These tags also appear in Glossed text view, underneath the Wordform tags of each token.
• At the Linguistic trait level, certain tokens were provided with one or more tags identifying traits of interest for more detailed linguistic analysis. Most are descriptive, but some concern the synchronic development of historical Slavic vowels, consonants or sequences. The actual tags are in shorthand, and the tags assigned to any one token can be found by clicking on any one token to retrieve its Token page; clicking on the shorthand tag itself will retrieve the full definition. At the search level, full definitions of the tags are given.
• At the Thematic content level, each line of text was provided with one or more tags identifying the topic of conversation. The chunks of lines to which these tags are assigned appear as a result of any one search; the individual tags assigned to any one line can be found on the Line page.
• At the Phrase level, a new component was set up to identify grammatically signficant sequences of words, to which one or more identifying tags was applied. The tags themselves are identified at the search level; the individual phrases identified for any one line can be found on the relevant Line page.
Principles of Transcription
Because one of the site’s major goals is to make Bulgarian dialectal data available to the broader public outside Bulgaria, the primary transcription system is based on the Latin alphabet; all texts have also been transcribed in Cyrillic, according to the accepted system used by Bulgarian dialectologists.
The primary transcription system is a combination of symbols adapted to the specific requirements of Bulgarian dialectal data. Certain symbols are taken from the International Phonetic Alphabet (IPA) and others from the academic transliteration that is the norm among Slavists. Where phonetic precision is needed in order to render important dialectal distinctions, IPA symbols are used. Elsewhere, simplified forms are used in order to make the transcription more accessible to non-phoneticians. The following symbols are used on the site:
|voiced bilabial stop
|voiceless alveolar affricate
|voiceless postalveolar affricate
|voiced alveolar stop
|voiced alveolar affricate
|звучен алвеодентален африкат
|voiced postalveolar affricate
|звучен преднонебен африкат
|voiceless labiodental fricative
|voiceless bilabial fricative
|voiced velar stop
|voiceless velar stop
|voiceless bilabial stop
|voiceless alveolar fricative
|voiceless postalveolar fricative
|voiceless alveolar stop
|voiced labiodental fricative
|voiced bilabial fricative
|voiceless velar fricative
|voiced postalveolar affricate
|central open unrounded
|back open unrounded
|приблизително като лабиално /а/
|front mid-closed unrounded
|front mid-open unrounded
|front closed unrounded
|central closed unrounded
|front closed rounded
|back mid-closed rounded
|back mid-open rounded
|back closed rounded
|central mid-closed unrounded
|back mid-open unrounded
|back closed unrounded
|above, but fronted