hebrewCorpus

search tool for students and scholars

logged in as: Shmuel Bolozky
Instructions
click on instructions link in search bar to access these instructions at any time

click on dark red text to expand and collapse information
General Information about hebrewCorpus
  • HebrewCorpus, which contains over 150 million words, was developed by The National Middle East Language Resource Center (NMELRC). More information about NMELRC is available here. HebrewCorpus is intended for scholars, teachers, and students of the Hebrew language. It gives the user the ability to:
    • Search subcorpora either separately or in combination.
    • Search for words alone or in combination, including phrases.
    • Choose from POS filters. This corpus is not tagged for part of speech, but it looks at affixes and word structure to attempt to predict what part of speech a word is. This still leaves some ambiguity, but it saves a lot of time.
    • Use most "regular expression" language, which serves to narrow searches considerably.
Instructions on Searching hebrewCorpus
Results
  • Basic Information about Results
    • Most results come back after a few seconds wait. Less common single words give you results in about 10 seconds. If the word is common or the subcorpus is large, it may take from 30 seconds to a minute to get the results (shorter common words searched as a string can sometimes take longer). For phrases, the amount of time varies.
    • If you search for something extremely common (such as an ayin by itself or multiple common words), the program will sometimes abort that specific search, and go to a blank screen. A trick that often works for this is to hit the back button on your browser; this will often display the search results.
    • As can be expected, the wait time is longer if you search all genres at once. In some cases the wait time is only slightly longer, and in other cases it is doubled or tripled.
    • One good way to save your results is to copy what you want from the corpus, open an excel document, select the rows you want to paste the information into, and go to paste special. Under paste special, choose paste as unicode text. This will create organized rows that you can save for later viewing or side-by-side comparison.
  • Summary Page
    • This page automatically comes up first after a search, and you can return to it at any time by clicking on 'summary'.
    • This page gives the following summary information about your search:
      • the word you searched for in transliterated characters
      • the search string you typed in, in both scripts; this may include extra search strings that the program uses to try to predict alternations, as under 'verb'
      • the subcorpus or database that you searched
      • the time it took the search engine to perform the search (this will generally be less than the actual time experienced by you, since it does not include the time it takes for the server to receive your request or to serve the results back to you).
      • the POS filter you chose
      • the POS filter that was actually used
      • the number of occurrences or "hits"
      • what that number translates to in terms of words per 100,000 words of the subcorpus
    • NOTE: The latter bit of information is useful for comparative purposes. The subcorpora are of vastly different sizes, so the actual number per subcorpus may be misleading; the number per 100,000 words is more easily compared.
  • Citations Page
    • Click on 'citations' in the dark blue bar to see the 10 words before and the 10 words after the "hits".
    • By default, these citations are sorted by the word that appears directly before the word you searched for.
      • Note that this sorting is done by the whole word not by the root, so spr ספר, hspr הספר, and whspr והספר are nowhere near each other.
      • The word directly before is repeated at the left-hand side of the page under 'sort word' so you can quickly glance through the citations and notice patterns, collocations, etc.
    • If you would rather see the citations sorted by the word directly after the "hits", click on the sentence near the top left of the page that reads 'sort by word after'.
    • The citations are shown 100 at a time.
      • If your search returns more than 100 citations, they will be organized into pages which you can access by clicking on the dark red numbers at the top.
    • The 'subsection' column indicates which subsection of the subcorpus the example comes from; e.g. NEWS, etc.
    • If you want to see the exact reference for the citation, click on the 'subsection' heading and it will change to 'reference'.
      • The references give information about the specific source; most are helpful although some are numbered and basically incomprehensible to the user.
    • If you want to see more context than 10 words before and 10 words after, click on the number to the left of the citation.
      • This will bring up a separate window that displays a paragraph of surrounding context from the source it comes from (or the whole citation if there are no copyright issues).
      • You may have to use your browser's search function (command + f on a Mac or control + f on a PC) and Hebrew script to find the word or expression you searched for, since it is not highlighted already.
  • Subsections Page
    • Click on 'subsections' to see the total occurrences for that word or expression in the various subsections of the subcorpus you searched. This can be useful to see where a word is generally used, such as kdwr כדור in SPOR. If you search using one of the combined subcorpora, 'subsections' will list the subcorpora rather than individual subsections.
    • The results in this section are ordered from most frequent to least frequent in terms of the number of occurrences.
    • The frequency per 100,000 for each subsection is also given; this is valuable because each subsection is of a different size, and this puts the word or expression into proper perspective for better comparison.
  • Word Forms Page
    • Click on 'word forms' to see the exact forms that your search found, ordered by frequency.
    • Click on any word in the word form list and a separate window will open with the citations for only that word form.
    • Examine this list to get important hints about normal usage.
    • Examine this list to identify what kinds of false "hits" you are getting so you can work to eliminate them, either by changing the POS filter, using regular expressions, or cutting them out by hand.
    • NOTE: After every search, it is strongly suggested that you examine this list before you take the rest of the results seriously.
      • See if there are any forms you expected to find that are not there.
      • See if there are any forms you did not expect to find that are there.
      • See if you can figure out what it is about the expression you typed in (and about Hebrew morphology) that would have created any problems you encountered.
  • Words Before/After Page
    • Click on 'words before/after' to see a list of the common words directly before and directly after the "hit", ordered by frequency.
    • Examine these lists to scope out the main usages and collocations of the word you searched for.
    • Examine these lists to identify structures that you had not intended to search for that you want to cut out.
    • Click on a word in the list to see the citations with only that particular word before or after.
      • A separate window will open showing you just those citations.
About Each Subcorpus
  • Click the following link to download a rich-text document with a comprehensive list of word totals for the subcorpora (and their subsections) found on the site:
  • Download Document with Word Counts

  • About each subcorpus, in order of appearance (*according to Wikipedia):
    • Arutz 7 is a written news website for an Israeli media network, which is identified with Religious Zionism and as the voice of the Israeli settlement movement.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
    • Beginning Newspapers is made up of two subsections: Yanshuf and ShaarLaMatchil. Both of these newspapers contain easy Hebrew. Although a limited amount of material from them is available here, it is a great resource to study voweling. This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
    • Erev Erev in Eilat is the first local newspaper in Israel. It was established in 1962.
    • Haaretz is Israel’s oldest daily newspaper. It was founded in 1918. It is said to be Israel’s most influential newspaper and is mostly read by the intelligentsia and the political and economic elites of Israel. It is described as liberal and left wing.* The subcorpus of Haaretz that is from the years 1990-91 is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
    • Maariv is a daily tabloid published in Israel. It is said to give a balanced representation of the diverse views that abound in Israeli society.*
    • Raanana Shelanu is a local newspaper for the city of the same name in the southern part of Israel.
    • TheMarker is an economic news website that offers news on hi-tech business, advertising, media, real estate, labor market, law, automobiles and transportation.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
    • Ynet is one of the most popular Israel news websites. It is owned and operated by Yediot Ahronot, but most of its content is original and published only on the website. In 2008, it was Israel’s most popular internet portal.*
    • The Tanach is the Hebrew Bible, broken into the subsections Torah, Neviim, and Ketuvim, from which the acronym TaNaKh comes from.*.
    • The Mishnah is the first major work of Rabbinic Judaism, and the first major written redaction of the Jewish oral traditions.* This text is provided courtesy of The Structured Mishnah, which is available online here.
    • Early Fiction is a collection of fiction from 30 early Hebrew authors that lived during the Haskalah and Hebrew revival. These texts are provided courtesy of Project Ben-Yehuda, which is an internet site devoted to making works in the public domain freely available. To visit this site, click here.
    • Modern Fiction-Orig is a collection of fiction written originally in Hebrew, as opposed to translated into Hebrew. Works are not included in their entirety; rather as excerpts from several books. It is divided into subsections by year of publication.
    • Modern Fiction-Tran is a collection of fiction translated into Hebrew, as opposed to written originally in Hebrew. It includes works from 24 different languages, which are separated into subsections.
    • Movies is a compilation of subtitles from 59 movies that have come out in the last forty-five years or so. These movies are divided into subsections by genre.
    • Spoken is a group of transcribed texts, from six distinct conversations (which coincide with the subsections): a meeting at a high-tech plant discussing personnel issues, a son telling his father about a trip he made to China and Mongolia, an informal car drive back from a wedding, a family going to have dinner and having it, a boy talking with his girlfriend, and a soldier and his commander discussing personal issues. Spoken is provided courtesy of CoSIH, or The Corpus of Spoken Israeli Hebrew. To read more about this project, click here.
    • Tapuz Forums is a collection of largely informal discussion that comes from Tapuz, an Israeli web portal.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
    • Journals is a large collection of specialized periodicals and journals on a variety of topics, largely for an academic audience. Among the topics discussed are medicine, Jewish life, law, education, and society.
    • Knesset is a collection of sessions from the Knesset, which is the legislature of Israel. This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
    • Wikibooks is a collection of free-content textbooks and annotated texts that anyone can edit.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
    • Wikinews is a collection of free-content news that anyone can edit. It differs from Wikipedia in that it is written in the format of news stories as opposed to encyclopedia articles.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
    • Wikipedia is a collection of free-content encyclopedia articles that can be edited by anyone.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
    • Wikiquote is a collection of free-content quotations from prominent people, books, films, and proverbs that can be edited even by visitors to the site.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
    • Wikisource is a collection of free-content textual sources. Its library includes novels, non-fiction works, letters, speeches, constitutional and historical documents, laws, and a range of other documents.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
  • A number of the newspaper subcorpora have four-letter abbreviations for their subsections. These are mostly self-explanatory, but for clarification they stand for the following:
    • CULT: Culture (culture and the arts, etc.)
    • INFO: Information (health, computers, help, encyclopedia, etc.)
    • LAWS: Laws (crime, investigations, law, etc.)
    • LIFE: Lifestyle (tourism, relationships, entertainment, food, etc.)
    • MISC: Miscellaneous (opinions, new age, letters to the editor, etc.)
    • NEWS: News (headlines, international, regional, etc.)
    • RPTS: Reports (various reports)
    • SPOR: Sports (various sports articles)
Announcements
  • 27 August 2010: The tutorial is now all available on the site, which will help users quickly learn the program and its uses.
  • 22 July 2010: Part of the tutorial is now available online; more will be added soon. A group mailing list has also been created for those interested in staying informed about updates and helpful hints.
  • 14 July 2010: Several more subcorpora have been added to hebrewCorpus, as well as a number of combined subcorpora.
  • 20 June 2010: An advanced search has been added to the corpus that allows the user to search for words with vowels. See 'advanced transliteration help' for a key to the letters used for the vowels.
  • 12 January 2010: The URL for hebrewCorpus has been changed from BYU to NMELRC.
  • 7 December 2009: A rough version of the instructions to guide the use of hebrewCorpus has been posted. Feel free to e-mail me with any additional questions or suggestions. Also, a tutorial to complement the instructions is being constructed—check back later for updates.
  • 3 December 2009: The English was fixed so it no longer displays the English with brackets around it and in a jumble of Hebrew and English.
  • 20 November 2009: All of the subcorpora were combined into an additional searchable subcorpus called 'All Newspapers'.
  • 11 November 2009: The corpus is being worked on to rid it of strange characters that aren't supposed to show up in the corpus (such as â¢, ¼Ö, and unconverted English characters).
  • 19 October 2009: The corpus has been put online through BYU.
  • 27 August 2009: Work has been started to convert the corpus data and create a beta version.
Tutorial
  • Introductory Note
    • The tutorial can complement your knowledge of the instructions, or it can be an effective way to quickly learn the uses of the program by getting hands-on practice. In each of these searches, it is assumed that you will use the 'latin chars' box in your searches, but the Hebrew is also provided in case you would prefer that. Good luck!
  • First Search: Looking for a single word
    • First, click 'instructions' to the right of the subcorpus drop-down box to pull up this tutorial in a separate window, so you can continue to read it as you search.
    • Click on 'transliteration help' to see a chart of the transliteration system in a separate window.
    • Type mwH מוח into the 'latin chars' box.
    • Choose 'noun' from the 'part of speech' drop-down box.
    • Choose 'Journals' from the subcorpus drop-down box.
    • Click on 'submit'.
    • Wait about 15 seconds, since bigger subcorpora take longer to search.
    • Examine the 'summary of search results', and note the total number of occurrences, and how frequent this is in the subcorpus per 100,000 words (which is located right below the number of occurrences). The latter statistic is important, because it helps you to compare the subcorpora of different sizes.
    • Click on 'citations' in the dark blue bar.
    • As you can see, each example gives you the word in context with 10 words before and 10 after.
    • Scroll down and look at a few of the citations. You will notice that the citations are organized by the word before; to make this easier to see, the word before is also shown under 'sort word'. Scroll back to the top.
    • Note that there are 44 pages of results. Clicking on additional pages provides another 100 results, thus there are 44 pages for 4,347 words.
    • Click on one of the numbers at the left, and a new window will open with even more context, if there is more in the surrounding paragraph.
    • Click on 'sort by word after' to sort the citations by the word after instead.
    • Next, click on 'subsections' in the dark blue bar.
    • Study the amount of occurrences and relative frequencies of this word in the various sections of Journals. MED:Neurology has by far the most occurrences, which can be expected given the subject matter. Note that even though this page is sorted by total number of occurrences, the frequency is what really shows which subsection uses the word the most. For example, even though EDU:Biology has more than three times the amount of occurrences of mwH than MED:Cardiology, if you look at the relative frequency you see that in actuality MED:Cardiology uses the word almost three times more than EDU:Biology.
    • Now click on 'word forms' in the dark blue bar.
    • Notice the different forms in which this word is found: with the definite article, by itself, with prepositional prefixes, with pronoun endings, etc.
    • Click on במוח to see the citations for just that word form in a separate tab. This is a good way to refine your search to only the results you want to see.
    • Click on 'words before/after' in the dark blue bar.
    • Examine the most common words that come before our search word and the most common words that come after. This is a great way to see collocates and figure out common idioms. Is this what you would have predicted?
    • Click on שבץ to see 413 citations of it in a new window.
    • Click on 'summary' in the dark blue bar to go back to the summary page.
  • Understanding the POS Filters
    • Type zqN זקן into the 'latin chars' box, choose the 'adj' POS and the 'ErevErev: 03-09' subcorpus, and click 'submit'.
    • Go through the various options in the dark blue bar as before. Note particularly under 'word forms' that you get examples of both masculine and feminine, singular and plural. You also get the definite article and the 'vav' prefix.
    • Now try searching the same word in the same subcorpus, only with the 'noun' POS chosen. This time under 'word forms' you also get attached prepositions, whereas other results are cut out.
    • Next, type sbbh סבבה into the 'latin chars' box, choose the 'adv' POS and the 'Tapuz Forums' subcorpus. Go through the various options once you get the results. Note that you now only get results for the bare word form and one instance of it with a vav.
    • Type ktb כתב into the 'latin chars' box, choose the 'verb' POS and the 'Raanana' subcorpus, and go through the various options looking at the results. You will notice that the program generated other strings to go with this search: ktb,kwtb,ktwb כתב,כותב,כתוב. This is the system's way of trying to get all of the expected conjugations of a verb.
    • Try other words and run them through each of the POS filters. Then look at word forms, and you will begin to understand the differences between all of the filters.
  • Searching Subcorpora by Themselves or Together
    • Type qwnsTlcyh קונסטלציה into the 'latin chars' box, choose the 'noun' POS and the 'Haaretz: 08' subcorpus, and click 'submit'.
    • Only six results were found. Go through the various options; you will notice that it is hard to see significant trends because of the small sample size. 'Subsections' only lists three sections, and the relative frequency is very small.
    • Now try the same word and POS, but this time choose >>ALL GENRES<<. This will search all of the subcorpora simultaneously, and since it is so big it will take a while to retrieve the results.
    • About three minutes later, you will see that it found 78 occurrences in the entire corpus. Go through each of the options. Notice particularly that on the 'subsections' page it does not give the amount of occurrences in each subsection, but instead the different subcorpora. This is also the case with the other combined subcorpora. It gives you an idea of which subcorpora this word occurs in, and which it does not (in which case the subcorpora does not show up in the results).
    • Now try this with a new word. Search for Twb טוב as an 'adj' in 'Spoken', and click on 'submit'. This gets 48 occurrences. Go through the options and notice the trends.
    • Search for this word in >>ALL GENRES<<. This word is common enough to where it will cause the program to stall after a few minutes, and present you with a blank screen. Click the 'back' button on your browser to retrieve the results. Go to subsections and you will see interesting distributions of Twb טוב in the subcorpora. Is this what you would have guessed?
  • Searching for More Than One Word at Once
    • Try searching for yldh ילדה as a 'noun' in 'Journals'. Go to 'word forms'. The plural for this word, yldwt ילדות, is not there since the program does not account for this variant. To overcome this in one search, type yldh,yldwt ילדה,ילדות (with no space) in the 'latin chars' box. Now go to 'word forms;, and you will see that both of these words are included in the results. Now try Eyr,EryM, עיר,ערים and examine these results.
  • Searching for Phrases
    • Search for mSA wmtN משא ומתן as a 'string' in 'ErevErev:03-09'. Go to 'words before/after'. These words come before and after the entire phrase that you searched. Try other phrases, as strings and as other parts of speech, then check word forms to see what it produces. An additional example is if you wanted to find rch רצה as in "she runs" rather than "he wanted". To find this you can reduce ambiguity by searching for hyA rch היא רצה.
  • Searching with Regular Expressions
    • Learning to Represent Word Characters
      • Type Any \w\w\wty אני \ו\ו\ותי into the 'latin chars' box as a 'string' in '--All Colloquial--'; \w \ו means any word character. Go to 'word forms'. This will give you all of the instances of Any אני followed by a verb in the past tense 1st person singular form in the three colloquial subcorpora.
      • Try searching for \w?ASr ו?אשר\ as a 'noun' in 'Tanach'. This is telling the program to find ASr אשרby itself as well as include any letter before it, in addition to the noun filtering. This will give you several results, including ASr, kASr, MAsr, אשר,כאשר,מאשר and other interesting forms.
      • Type hwA\s?\w* rch הוא\ס?\ו* רצה as a 'string' in 'Ynet: 00-09'. This is telling the program to find hwA rch הוא רצה, as well as any words that come between the phrase, since \w* \ו* can mean any amount of word characters, including none. \s? ?ס\ means that there can be one or no spaces there, so the program does not have to find two spaces. With this search, you will find some interesting words between hwA הוא and rch רצה.
      • Finally, type nwrA \w+ +נורא \ו as a 'string' in 'Modern Fiction-Orig: 05-10'. This is telling the program to find one or more instances of a word character after nwrA נורא and a space. Look through the different words that come after nwrA נורא here under 'word forms'.
    • Learning to Represent Quantifiers
      • Now type Sw?lHN שו?לחן as a 'noun' in '--All Literature--'. You will notice that this brings up two spellings of SwlHN שולחן: with the vav or without. Note that there is often a reason why it is spelled without a vav: sometimes it is in construct and sometimes it is from a source that has had its vowels stripped out. To view the source with vowels, go to 'advanced search'.
      • Next, search for Ah+ +אח as an 'adv' in '--All Colloquial--'. What results does this give you?
      • Type Ey*r עי*ר as an 'adj' in 'Haaretz: 08'. This will give you many forms of Eyr עיר and Eyyrh עיירה, as well as a number of other words.
    • Learning to Represent any Character
      • Type byt.spr בית.ספר as a 'noun' in 'Wikipedia'. Note that period means any character, including spaces and punctuation. Wikipedia is the largest subcorpus, so it will take about two minutes. Go to 'word forms', and you will find many forms of byt spr בית ספר, but also byt-spr בית-ספר.
      • Now type wbkN.? ?.ובכן as an 'adv' in 'Journals'. This will give you wbkN ובכן, but also different forms of punctuation that come after it.
      • Search for byN.+wbyN בין.+ובין as an 'adv' in 'Early Fiction'. Look through 'citations' and see what results it retrieved. Some of the results are exactly what you would expect, whereas others are slightly different.
      • Type rwdF AHry.* *.רודף אחרי as a 'string' in '--All News--'. Now look at word forms and see what results there are.
    • Learning to Use Square Brackets
      • Type mw[zs]yqh מו[זס]יקה as a 'noun' in 'Journals'. Square brackets indicate that more than one character can occupy that space. This will give you results for both mwzyqh מוזיקה and mwsyqh מוסיקה .
      • Now type mw[^zs]yqh מו[^זס]יקה as the same POS in the same corpus. The caret before the z ז and s ס tell the program to find anything but these two letters. This will find a few occurrences of mwnyqh מוניקה and mwryqh מוריקה; note that the characters in that position are not z ז or s ס.
    • Learning to Use Parentheses
      • Parentheses, similar to square brackets, are used to show alternate possibilities. For example, type Sn(h|yM|w?t) (שנ(ה|ים|ו?ת as an 'adv' in 'Raanana'. This will give you results for the singular, plural, and the construct form for both singular and plural.
      • You can also use a question mark after the parentheses to include the word outside of it in the results as well. For example, search for ly?b(y|K|w|h|kM|kN|M|N|nw)? ?(לי?ב(י|ך|ו|ה|כם|כן|ם|ן|נו as an 'adv' in 'Journals'. This will give you examples of lb לב with the pronoun endings, but it will also give you lb לב in the results.
    • Learning to Represent Word Boundaries
      • Even though 'adv' lets you limit what the program accepts, it does not let you limit it to only that word form. This is possible by using \b ב\. For example, search for Edh עדה as a 'adv' in '--All News--'. This will give you 1,663 occurrences of wEdh ועדה in addition to the 278 instances of Edh עדה that you want. You can prevent this from happening by performing the same search as \bEdh בעדה\ as an 'adv', which marks the word boundary.
    • Learning to Search for the Beginning and End of Words
      • You can also use ^ outside of square brackets to mark the beginning of words. Type ^AnTy אנטי^ as a 'string' in 'Journals'. This will give you all words starting with AnTy אנטי.
      • In addition, $ outside of square brackets marks the end of word. Type lwgyh$ $לוגיה as a 'string' in 'Journals'. This will give you all of the words ending in lwgyh לוגיה.
  • Searching for Punctuation
    • In order to search for punctuation, it is necessary to put a front slash before it. As an example, type zh\. .\זה to find all examples of zh זה with a period after it. Do not type zh. .זה (without the front slash) or you will get results of zh זה with any character after it. Try this with other forms of punctuation.
  • Cutting Out Results by Hand
    • Type b ב as a 'noun' in 'Modern Fiction-Tran:05-10'. Notice that you get the forms of the preposition b ב that you are looking for, but you also get byN בין, bN בן, etc., which you do not want. To cut these forms out, search for b -- (byN|bN) as the same POS in the same subcorpus. This will only cut out the exact forms byN בין and bN בן. To cut out more forms, you will have to include more in the search or use regular expressions to list more.
    • As a further example, suppose that you want to search for byt בית in phrases other than byt spr בית ספר. To do this, search for byt \w+ -- spr בית \ו+ -- ספר. This will give you all results except for byt spr בית ספר.
  • Using Advanced Search
    • Click on 'advanced search' under the 'submit' button. On that page, click on 'instructions'. This will give you instructions you can follow as you use the search.
    • Now click 'search by hand'. This is the only fully usable advanced search so far. Try searching for \wx\wx\w as an 'adv' in 'Tanach'. This will give you quite a few different segholate nouns that occur in the Hebrew Bible.
    • You can search for many other words and constructions using this search. Just remember that you need to be flexible, and include the possibility of different orders (for example, the dagesh might come before a vowel, or the vowel before the dagesh). You should also put a question mark after the vowel if you want the program to find the word unvoweled along with voweled.
Questions/Problems
  • What are some common errors to avoid?
    • Typing the noun with an article, unless you mean to limit your search (the tool automatically looks for nouns and adjectives with and without the article; if you type it with the article it won't find any examples without it).
    • Choosing the wrong POS filter (looking for an adjective using the 'noun' POS, looking for all instances of a word using the 'adv' POS, etc.).
    • Choosing 'noun' when looking for a phrase (you should usually choose 'adv' or 'string').
  • Why do I want to filter my results?
    • If you are trying to find examples of a particular word or construction, and the program finds hundreds of results of something else that happens to match what you are looking for (because of the morphological ambiguity of Hebrew), it can be very time consuming and annoying to search through the citations one by one to find the ones you really want. If you can figure out a way to filter out the bad ones, you can save yourself many hours.
  • How do I take care of filtering myself, and bypass the POS filters?
    • Choose 'string' and type a regular expression that matches what the POS filters do, or that varies them. For example:
    • \bSqr\b בשקר\ב\ will find ONLY the word Sqr שקר with no prefixes or suffixes.
    • \b[bl]?byt\b ב[בל]?בית\ב\ allows byt בית, bbyt בבית, lbyt לבית, and nothing else.
    • \b[mS]?h?AH\b ב[מש]?ה?אח\ב\ allows AH אח, hAH האח, mAH מאח, mhAH מהאח, SAH שאח, and ShAH שהאח, and nothing else.
    • \b[mS]?h?AHy?(K|w|h)?\b ב[מש]?ה?אחי?(ך|ו|ה)?\ב\ allows all of the above and the singular pronoun endings.
    • It is important to remember that the program will find only the forms you enter in this way, but if there is a punctuation mark without a space between it and another word, those marks will show up in the results as well. For example, in the search for \b[bl]?byt\b ב[בל]?בית\ב\, mSq-byt משק-בית would show up in the results. You will also find a lot of typos in the papers this way.
    • The program can deal with complex expressions efficiently, so you can get quite a bit of control over what you are searching for if you want it.
  • Are there copyright issues in making these texts available?
  • What browser is recommended for this site?
    • The Firefox browser is recommended, although you should be able to use this program with other browsers as well.
  • Why do characters sometimes show up incorrectly?
    • Due to the right-to-left nature of Hebrew, sometimes when numbers or special characters are next to other characters they can be displayed in an unusual way.
  • Other concerns not listed here
    • If your concern is not listed here, feel free to contact Justin Parry at justin.parry@mail.utexas.edu with any additional questions. A link is provided at the bottom of the website which will send you directly to mail.
Site maintained by the College of Humanities. For other questions or suggestions, please contact Justin Parry