hebrewCorpus

General Information about hebrewCorpus

HebrewCorpus, which contains over 150 million words, was developed by The National Middle East Language Resource Center (NMELRC). More information about NMELRC is available here. HebrewCorpus is intended for scholars, teachers, and students of the Hebrew language. It gives the user the ability to:

Search subcorpora either separately or in combination.
Search for words alone or in combination, including phrases.
Choose from POS filters. This corpus is not tagged for part of speech, but it looks at affixes and word structure to attempt to predict what part of speech a word is. This still leaves some ambiguity, but it saves a lot of time.
Use most "regular expression" language, which serves to narrow searches considerably.

Instructions on Searching hebrewCorpus

Basic Search Instructions
- Either type a search into the 'latin chars' box using transliterated latin characters or type into the 'hebrew chars' box using Hebrew characters.
- Choose a part of speech from the drop-down box to filter the results based on structure and affixes; if you do not want to filter the results, choose 'string'.
- Choose a subcorpus from the drop-down box to search.
- Click on 'submit'.
Detailed Search Instructions
- Searching for Words
  - Type search word or phrase into one of the two boxes on the left (for specialized searches, see "Advanced Search Capabilities"). Type it WITHOUT vowels.
  - EITHER: Type transliteration into the 'latin chars' box, using the ht transliteration system (click 'transliteration help' to see chart; note that there is a one-to-one correspondence between Hebrew and the transliterated letters).
- Choosing a Subcorpus
  - Select one of the subcorpora to search, or search them together using one of the combinations, such as --All Literature--. See "Information About Subcorpora" for information about each subcorpora, as well as a downloadable list of word totals.
- Understanding Part-of-Speech Filters
  String
  
  String will not apply a filter to the results.
  
  EVERYTHING that matches your string will be returned, no matter what is before or after it.
  
  For example, if you type ktb כתב, you will get every example of that string in the subcorpus, including the following:
  
  lktbh לכתבה
  
  nktb נכתב
  
  mktb מכתב
  
  hhtktbwywt ההתכתבויות
  
  bktbwtyh בכתבותיה
  
  hhktbh ההכתבה
  
  ktbyK כתביך
  
  NOTE: Any part-of-speech (POS) filter other than 'string' will filter your results to:
  
  The search string alone, with only a space or punctuation before or after it (except for 'verb', which includes other regular conjugations)
  
  The search string with suffixes and prefixes that go with the chosen POS
  
  All other instances of the string will be filtered out (those that are not the bare form, or with the known suffixes and prefixes).
  
  Noun
  
  If you choose 'noun' the program will accept:
  
  the bare search string typed in
  
  the definite article h-ה
  
  the masculine plural ending -yM ים-
  
  the dual ending –yyM יים-
  
  the conjunction w-ו
  
  the subordinators S-ש and kS-כש
  
  the attached prepositions b-ב, l-ל, k-כ, and m-מ
  
  the singular and plural pronoun endings
  
  NOTE: Plurals ending in -wt ות- and irregular nouns ending in –yM ים- do not come up in a singular noun search and must be searched for separately.
  
  If you type in ktb כתב as a 'noun', the program will accept (for example):
  
  ktb כתב
  
  SktbyM שכתבים
  
  wktbyM וכתבים
  
  mhktb מהכתב
  
  bktb בכתב
  
  kktbM ככתבם
  
  kSktby כשכתבי
  
  wlktbyw ולכתביו
  
  But it will not accept (it will filter out):
  
  nktb נכתב
  
  wtktby ותכתבי
  
  bmktb במכתב
  
  ktbwt כתבות
  
  ktbty כתבתי
  
  If you search for a noun that ends with the feminine marker h-ה, the program knows to allow forms that have a t-ת when it is in the construct form or there are pronoun endings.
  
  If you type in yldh ילדה, it will accept (for example):
  
  yldh ילדה
  
  hyldh הילדה
  
  yldty ילדתי
  
  yldtnw ילדתנו
  
  If you choose 'noun', it does NOT mean all your results will be nouns. It only means that given the morphological ambiguity of Hebrew, they COULD be nouns.
  
  For example, if you type in ktb כתב and choose 'noun', some of the results will be unambiguously nouns:
  
  bktb בכתב
  
  ktbyM כתבים
  
  while others will simply be ambiguous:
  
  ktb כתב (could be ‘(hand)writing’ or ‘he wrote’)
  
  wktb וכתב (could be ‘and (hand)writing’ or ‘and he wrote’)
  
  Sktb שכתב (could be ‘that (hand)writing’ or ‘that he wrote’)
  
  The program WILL filter out forms that unambiguously are NOT nouns, which can be very helpful.
  
  Choosing a filter, therefore, reduces the number of false hits, but does not eliminate them entirely.
  
  Adj
  
  If you choose 'adj', the program will accept:
  
  the bare search string typed in
  
  the definite article h-ה
  
  the conjunction w-ו
  
  the singular and plural masculine and feminine forms (if you search with the masculine singular form)
  
  NOT the prepositions or the pronoun endings
  
  If you type in HkM חכם, the program will accept (for example):
  
  HkM חכם
  
  Hkmh חכמה
  
  hHkmyM החכמים
  
  whHkmwt והחכמות
  
  but it will not accept (it will filter out):
  
  ShHkM שהחכם
  
  Hkmy חכמי
  
  If you want to try to find forms like ShHkM שהחכם and Hkmy חכמי, you need to search for HkM חכם as a noun, not as an adjective.
  
  Alternatively, you could search directly for ShHkM שהחכם and Hkmy חכמי as strings, thereby bypassing the POS filters.
  
  Again, If you choose 'noun' or 'adj' or any other part of speech, it does not necessarily mean that you are searching for that part of speech; it only means you want to allow that particular set of prefixes and suffixes through the search filter.
  
  Adv
  
  If you choose 'adv', the program will accept:
  
  the bare search string typed in
  
  the conjunction w-ו
  
  and nothing else
  
  This category is handy for adverbs, but it is also useful when you are searching for a specific form.
  
  For example, if you want to find the noun ktb כתב only when it is preceded by b-ב (and not all the other possible forms of ktb כתב), type bktb בכתב and choose 'adv'. It will accept:
  
  bktb בכתב
  
  wbktb ובכתב
  
  and nothing else. The same technique works if you want to find a specific verb form (nktb נכתב) as opposed to all the conjugations of that verb.
  
  If you choose 'adv', it does not necessarily mean you think the word is an adverb; it just means that you want the filter to cut out everything except the specific form you typed in, as well as that form prefixed by w-ו.
  
  If you want to search for a form completely by itself (without the w-ו), read about marking word boundaries under "Searching with Regular Expressions".
  
  If you choose 'adv', it is almost the opposite of choosing 'string'. 'String' accepts every occurrence of the string in the subcorpus, no matter what characters surround it, while 'adv' accepts only what you typed as an isolated word, that word preceded by w-ו, and nothing else.
  
  Verb
  
  If you choose 'verb', the program will accept:
  
  the search string typed in
  
  the perfect and imperfect verb regular conjugation suffixes and prefixes
  
  the regular infinitive and imperative forms
  
  the passive participle form
  
  the conjunction w-ו
  
  the subordinators S-ש and kS-כש
  
  Because of the programming, it will also accept forms with the definite article h-ה and the singular and plural pronoun endings.
  
  NOTE: The program assumes you will type in at least one verb form. If you choose to be more specific, separate different forms by a comma WITH NO SPACES. You can also see all of the different forms by typing in the masculine singular (hwA הוא) form. Usually, the more forms of the masculine you type in (past, present, future), the less results the program will return and the less problematic the results will be. You may only need to type in one form, but if you do not get all of the results you want, type in the different forms yourself.
  
  If you type in ktb,yktwb כתב,יכתוב, it will accept (for example):
  
  ktb כתב
  
  lktwb לכתוב
  
  ktbw כתבו
  
  Syktwb שיכתוב
  
  kSktbty כשכתבתי
  
  wktbnw וכתבנו
  
  Because the program more or less mechanically applies rules without understanding what the resulting forms mean, it will also accept some unwanted forms like:
  
  ktby כתבי (the writing of)
  
  nktb נכתב (to be written; niphal)
  
  The program is set up to understand the different binyanim in its searches, but note that it will only tell you on the 'summary' page which binyan it found if the search is performed in the 3ms past tense and it is a regular verb; otherwise it will just say 'vPaal'.
  
  The program also tries to handle hollow, defective, and other special verbs correctly. Its analyses are not perfect, however, so you should check the search strings it actually applies on the 'summary' page.
Advanced Search Capabilities
- Searching for More than One Word at Once
  - If you search for more than one word at once, it will sort the results together. This can be handy, for example, if you want all examples of both a feminine or irregular noun and its plural.
  - To search for more than one word at once, type in the words separated by a comma. Be sure NOT to add extra spaces.
  - To find 'father' and 'fathers', type: Ab,Abwt אב,אבות and choose 'noun'.
  - To find two different ways to say 'finally', type lbswF,swF swF לבסוף,סוף סוף and choose 'string'.
  - Remember that even though there is a standard orthography for plene writing, once in a while an alternate spelling is used. For example, Tlwyzyh טלויזיה is sometimes spelled as Tlwwyzyh טלוויזיה; the same is true of bEyh בעיה or bEyyh בעייה. You may not want both results, but searching for both of them will give you a fuller picture of the word.
  - If the words you type in need different POS filters (an adverb and a noun, for example), results will be unpredictable.
  - NOTE: Searching for more than one word at once when the 'verb' POS filter is chosen is effective only for different forms of the verb; searching for other things under the verb POS filter gives you unpredictable results that you probably do not want.
- Searching for Phrases
  - Hitting the space bar will cause the program to look for a space; this means that if you type two words (or more) separated by a space, it will look for them as a phrase.
  - Most of the POS filters make no sense when searching for phrases, but they can sometimes be helpful.
  - Normally you would choose 'string' to find the phrase no matter what else is around it. Choosing other POS filters allows the additions that each filter permits.
- Searching with Regular Expressions
  - Using regular expressions can significantly increase the power of your search.
  - There are several ways to search for certain things using both the POS filters and regular expressions; learning them (as shown below) will open up a world of unique searches.
  - You can use the backslash character in conjunction with certain letters to mean specific things:
  - You can search with more than one word character in a row as a very effective way to create word skeletons. For example:
  - You can use quantifiers with word characters to mean specific things:
  - You can also use quantifiers with specific characters. For example:
  - You can search for any character (a space, letter, punctuation mark, number, etc.) by inserting a period (.) into the search. For example:
  - You can also use quantifiers with the period. For example:
  - You can use the caret ^ with specific characters to exclude them from the search. For example:
  - You can do this with a list of characters, so it will list anything but what you cut out. For example:
  - You can also use square brackets to indicate a list of things, one of which can go in that position in the word. For example:
  - You can also use quantifiers with the list:
  - You can use ^ and $ (without brackets) to indicate the beginning and end of a word, respectively. In this search, it is necessary to use the ‘string’ POS. For example:
  - You can use parentheses and the vertical bar to indicate alternation between whole forms. For example:
  - You need to think carefully about how your regular expression is going to interact with the POS filter you have chosen. For example:
  - If you want complete control, choose 'string', and only what you cut out yourself will be cut out.
  - Using all of these regular expressions in combination can be a very powerful way to design precise searches. For example:
- Using Hebrew Script in Regular Expressions
  - As shown in the examples above, regular expressions work in Hebrew script as well as they do in English. Just remember expressions need to be entered on the opposite side of the character, since Hebrew is written from right to left.
  - If you are unsure about where the special characters are found on the Hebrew or Hebrew QWERTY keyboards, it is possible to see the layout online. If you still cannot find the characters you need on your keyboard, you can open Character Palette on Macs (available under 'International'), or Character Map on Windows and enter the characters manually.
  - NOTE: Once the results are displayed, most browsers display the backslashes and parenthesis in bizarre and unpredictable ways. This has no effect on your search; it is just a problem with displaying Hebrew.
- Cutting Out Results by Hand
  - Like searching with regular expressions, cutting out results by hand can be a powerful tool in refining your searches.
  - Sometimes you know that the ambiguous morphology of a particular form is going to give you many false hits you do not want.
  - To cut out forms that otherwise would be found by a search, type your search word or expression, a space, one or two dashes, a space, and then a word with or without regular expressions that indicates what you don't want. For example, spr -- mspr ספר -- מספר cuts out mspr מספר from the results.
  - You can use a vertical bar to cut out multiple things.
  - For example, if you search for all forms of the verb dybr,mdbr דיבר,מדבר, you will get many examples of dbr דבר and other undesired forms. To avoid this, you can type dybr,mdbr -- ^w?h|^w?dbr דיבר,מדבר -- ^ו?ה|^ו?דבר. If you perform this search as a verb on ynet, you will cut out 15,832 instances of: dbr דבר, dbry דברי, hmdbr המדבר, dbrw דברו, wdbr ודבר, wdbrw ודברו, wdbry ודברי, hmdbryM המדברים, hmdbrt המדברת, whmdbr והמדבר, hmdbrwt המדברות, and whmdbrt והמדברת (all of these unwanted words account for more than half of the results found for the particularly ambiguous dybr,mdbr דיבר,מדבר searched by itself).
  - This does not remove all of the ambiguity, but it does cut out a significant amount and saves a lot of time. It is valuable for many reasons, one being that you can look at the 'words before/after' page without wondering if an undesired form is producing the results.
  - It is also possible to cut out words or letters before or after the searched word, another effective way to reduce ambiguity or choose more selectively. For example, if you want to search for HwC חוץ as in 'outside', but not get results for 'apart from', you can perform this search: \bHwC \w+ -- m\w+ +בחוץ \ו+ -- מ\ו\ . This will cut out all instances in which m-מ follows HwC חוץ, such as the expression HwC mzh חוץ מזה. This will cut out more than half of the results.
  - NOTE: Whenever you search for words next to each other, it throws off the results on the 'word forms' page. It also presents forms on the 'words before/after' page that you wouldn't expect because it is presenting words next to the whole usage.
- Searching for Punctuation
  - Many punctuation marks can be searched for as is, but some have corresponding symbols that are also used as regular expressions. The best and easiest way to overcome this is to put a backslash before them. This tells the program to look for the exact punctuation mark. For example, g\’wrg\’ '\ג\'ורג, Amr\. .\אמר, etc.
- Using the Advanced Search Page
  - There is a separate page on this site for the advanced search. It is still being developed, but so far it has:
  - This page has its own set of instructions. To access the page, click 'advanced search' under the subcorpus drop-down box. The instructions will explain what all of the options are.

Results

Basic Information about Results
- Most results come back after a few seconds wait. Less common single words give you results in about 10 seconds. If the word is common or the subcorpus is large, it may take from 30 seconds to a minute to get the results (shorter common words searched as a string can sometimes take longer). For phrases, the amount of time varies.
- If you search for something extremely common (such as an ayin by itself or multiple common words), the program will sometimes abort that specific search, and go to a blank screen. A trick that often works for this is to hit the back button on your browser; this will often display the search results.
- As can be expected, the wait time is longer if you search all genres at once. In some cases the wait time is only slightly longer, and in other cases it is doubled or tripled.
- One good way to save your results is to copy what you want from the corpus, open an excel document, select the rows you want to paste the information into, and go to paste special. Under paste special, choose paste as unicode text. This will create organized rows that you can save for later viewing or side-by-side comparison.
Summary Page
- This page automatically comes up first after a search, and you can return to it at any time by clicking on 'summary'.
- This page gives the following summary information about your search:
- NOTE: The latter bit of information is useful for comparative purposes. The subcorpora are of vastly different sizes, so the actual number per subcorpus may be misleading; the number per 100,000 words is more easily compared.
Citations Page
- Click on 'citations' in the dark blue bar to see the 10 words before and the 10 words after the "hits".
- By default, these citations are sorted by the word that appears directly before the word you searched for.
- If you would rather see the citations sorted by the word directly after the "hits", click on the sentence near the top left of the page that reads 'sort by word after'.
- The citations are shown 100 at a time.
- The 'subsection' column indicates which subsection of the subcorpus the example comes from; e.g. NEWS, etc.
- If you want to see the exact reference for the citation, click on the 'subsection' heading and it will change to 'reference'.
- If you want to see more context than 10 words before and 10 words after, click on the number to the left of the citation.
Subsections Page
- Click on 'subsections' to see the total occurrences for that word or expression in the various subsections of the subcorpus you searched. This can be useful to see where a word is generally used, such as kdwr כדור in SPOR. If you search using one of the combined subcorpora, 'subsections' will list the subcorpora rather than individual subsections.
- The results in this section are ordered from most frequent to least frequent in terms of the number of occurrences.
- The frequency per 100,000 for each subsection is also given; this is valuable because each subsection is of a different size, and this puts the word or expression into proper perspective for better comparison.
Word Forms Page
- Click on 'word forms' to see the exact forms that your search found, ordered by frequency.
- Click on any word in the word form list and a separate window will open with the citations for only that word form.
- Examine this list to get important hints about normal usage.
- Examine this list to identify what kinds of false "hits" you are getting so you can work to eliminate them, either by changing the POS filter, using regular expressions, or cutting them out by hand.
- NOTE: After every search, it is strongly suggested that you examine this list before you take the rest of the results seriously.
Words Before/After Page
- Click on 'words before/after' to see a list of the common words directly before and directly after the "hit", ordered by frequency.
- Examine these lists to scope out the main usages and collocations of the word you searched for.
- Examine these lists to identify structures that you had not intended to search for that you want to cut out.
- Click on a word in the list to see the citations with only that particular word before or after.

About Each Subcorpus

Click the following link to download a rich-text document with a comprehensive list of word totals for the subcorpora (and their subsections) found on the site:

Download Document with Word Counts

About each subcorpus, in order of appearance (*according to Wikipedia):

Arutz 7 is a written news website for an Israeli media network, which is identified with Religious Zionism and as the voice of the Israeli settlement movement.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
Beginning Newspapers is made up of two subsections: Yanshuf and ShaarLaMatchil. Both of these newspapers contain easy Hebrew. Although a limited amount of material from them is available here, it is a great resource to study voweling. This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
Erev Erev in Eilat is the first local newspaper in Israel. It was established in 1962.
Haaretz is Israel’s oldest daily newspaper. It was founded in 1918. It is said to be Israel’s most influential newspaper and is mostly read by the intelligentsia and the political and economic elites of Israel. It is described as liberal and left wing.* The subcorpus of Haaretz that is from the years 1990-91 is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
Maariv is a daily tabloid published in Israel. It is said to give a balanced representation of the diverse views that abound in Israeli society.*
Raanana Shelanu is a local newspaper for the city of the same name in the southern part of Israel.
TheMarker is an economic news website that offers news on hi-tech business, advertising, media, real estate, labor market, law, automobiles and transportation.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
Ynet is one of the most popular Israel news websites. It is owned and operated by Yediot Ahronot, but most of its content is original and published only on the website. In 2008, it was Israel’s most popular internet portal.*
The Tanach is the Hebrew Bible, broken into the subsections Torah, Neviim, and Ketuvim, from which the acronym TaNaKh comes from.*.
The Mishnah is the first major work of Rabbinic Judaism, and the first major written redaction of the Jewish oral traditions.* This text is provided courtesy of The Structured Mishnah, which is available online here.
Early Fiction is a collection of fiction from 30 early Hebrew authors that lived during the Haskalah and Hebrew revival. These texts are provided courtesy of Project Ben-Yehuda, which is an internet site devoted to making works in the public domain freely available. To visit this site, click here.
Modern Fiction-Orig is a collection of fiction written originally in Hebrew, as opposed to translated into Hebrew. Works are not included in their entirety; rather as excerpts from several books. It is divided into subsections by year of publication.
Modern Fiction-Tran is a collection of fiction translated into Hebrew, as opposed to written originally in Hebrew. It includes works from 24 different languages, which are separated into subsections.
Movies is a compilation of subtitles from 59 movies that have come out in the last forty-five years or so. These movies are divided into subsections by genre.
Spoken is a group of transcribed texts, from six distinct conversations (which coincide with the subsections): a meeting at a high-tech plant discussing personnel issues, a son telling his father about a trip he made to China and Mongolia, an informal car drive back from a wedding, a family going to have dinner and having it, a boy talking with his girlfriend, and a soldier and his commander discussing personal issues. Spoken is provided courtesy of CoSIH, or The Corpus of Spoken Israeli Hebrew. To read more about this project, click here.
Tapuz Forums is a collection of largely informal discussion that comes from Tapuz, an Israeli web portal.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
Journals is a large collection of specialized periodicals and journals on a variety of topics, largely for an academic audience. Among the topics discussed are medicine, Jewish life, law, education, and society.
Knesset is a collection of sessions from the Knesset, which is the legislature of Israel. This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
Wikibooks is a collection of free-content textbooks and annotated texts that anyone can edit.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
Wikinews is a collection of free-content news that anyone can edit. It differs from Wikipedia in that it is written in the format of news stories as opposed to encyclopedia articles.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
Wikipedia is a collection of free-content encyclopedia articles that can be edited by anyone.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
Wikiquote is a collection of free-content quotations from prominent people, books, films, and proverbs that can be edited even by visitors to the site.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
Wikisource is a collection of free-content textual sources. Its library includes novels, non-fiction works, letters, speeches, constitutional and historical documents, laws, and a range of other documents.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.

A number of the newspaper subcorpora have four-letter abbreviations for their subsections. These are mostly self-explanatory, but for clarification they stand for the following:

CULT: Culture (culture and the arts, etc.)
INFO: Information (health, computers, help, encyclopedia, etc.)
LAWS: Laws (crime, investigations, law, etc.)
LIFE: Lifestyle (tourism, relationships, entertainment, food, etc.)
MISC: Miscellaneous (opinions, new age, letters to the editor, etc.)
NEWS: News (headlines, international, regional, etc.)
RPTS: Reports (various reports)
SPOR: Sports (various sports articles)

Announcements

27 August 2010: The tutorial is now all available on the site, which will help users quickly learn the program and its uses.
22 July 2010: Part of the tutorial is now available online; more will be added soon. A group mailing list has also been created for those interested in staying informed about updates and helpful hints.
14 July 2010: Several more subcorpora have been added to hebrewCorpus, as well as a number of combined subcorpora.
20 June 2010: An advanced search has been added to the corpus that allows the user to search for words with vowels. See 'advanced transliteration help' for a key to the letters used for the vowels.
12 January 2010: The URL for hebrewCorpus has been changed from BYU to NMELRC.
7 December 2009: A rough version of the instructions to guide the use of hebrewCorpus has been posted. Feel free to e-mail me with any additional questions or suggestions. Also, a tutorial to complement the instructions is being constructed—check back later for updates.
3 December 2009: The English was fixed so it no longer displays the English with brackets around it and in a jumble of Hebrew and English.
20 November 2009: All of the subcorpora were combined into an additional searchable subcorpus called 'All Newspapers'.
11 November 2009: The corpus is being worked on to rid it of strange characters that aren't supposed to show up in the corpus (such as â¢, ¼Ö, and unconverted English characters).
19 October 2009: The corpus has been put online through BYU.
27 August 2009: Work has been started to convert the corpus data and create a beta version.

Tutorial

Introductory Note
- The tutorial can complement your knowledge of the instructions, or it can be an effective way to quickly learn the uses of the program by getting hands-on practice. In each of these searches, it is assumed that you will use the 'latin chars' box in your searches, but the Hebrew is also provided in case you would prefer that. Good luck!
First Search: Looking for a single word
- First, click 'instructions' to the right of the subcorpus drop-down box to pull up this tutorial in a separate window, so you can continue to read it as you search.
- Click on 'transliteration help' to see a chart of the transliteration system in a separate window.
- Type mwH מוח into the 'latin chars' box.
- Choose 'noun' from the 'part of speech' drop-down box.
- Choose 'Journals' from the subcorpus drop-down box.
- Click on 'submit'.
- Wait about 15 seconds, since bigger subcorpora take longer to search.
- Examine the 'summary of search results', and note the total number of occurrences, and how frequent this is in the subcorpus per 100,000 words (which is located right below the number of occurrences). The latter statistic is important, because it helps you to compare the subcorpora of different sizes.
- Click on 'citations' in the dark blue bar.
- As you can see, each example gives you the word in context with 10 words before and 10 after.
- Scroll down and look at a few of the citations. You will notice that the citations are organized by the word before; to make this easier to see, the word before is also shown under 'sort word'. Scroll back to the top.
- Note that there are 44 pages of results. Clicking on additional pages provides another 100 results, thus there are 44 pages for 4,347 words.
- Click on one of the numbers at the left, and a new window will open with even more context, if there is more in the surrounding paragraph.
- Click on 'sort by word after' to sort the citations by the word after instead.
- Next, click on 'subsections' in the dark blue bar.
- Study the amount of occurrences and relative frequencies of this word in the various sections of Journals. MED:Neurology has by far the most occurrences, which can be expected given the subject matter. Note that even though this page is sorted by total number of occurrences, the frequency is what really shows which subsection uses the word the most. For example, even though EDU:Biology has more than three times the amount of occurrences of mwH than MED:Cardiology, if you look at the relative frequency you see that in actuality MED:Cardiology uses the word almost three times more than EDU:Biology.
- Now click on 'word forms' in the dark blue bar.
- Notice the different forms in which this word is found: with the definite article, by itself, with prepositional prefixes, with pronoun endings, etc.
- Click on במוח to see the citations for just that word form in a separate tab. This is a good way to refine your search to only the results you want to see.
- Click on 'words before/after' in the dark blue bar.
- Examine the most common words that come before our search word and the most common words that come after. This is a great way to see collocates and figure out common idioms. Is this what you would have predicted?
- Click on שבץ to see 413 citations of it in a new window.
- Click on 'summary' in the dark blue bar to go back to the summary page.
Understanding the POS Filters
- Type zqN זקן into the 'latin chars' box, choose the 'adj' POS and the 'ErevErev: 03-09' subcorpus, and click 'submit'.
- Go through the various options in the dark blue bar as before. Note particularly under 'word forms' that you get examples of both masculine and feminine, singular and plural. You also get the definite article and the 'vav' prefix.
- Now try searching the same word in the same subcorpus, only with the 'noun' POS chosen. This time under 'word forms' you also get attached prepositions, whereas other results are cut out.
- Next, type sbbh סבבה into the 'latin chars' box, choose the 'adv' POS and the 'Tapuz Forums' subcorpus. Go through the various options once you get the results. Note that you now only get results for the bare word form and one instance of it with a vav.
- Type ktb כתב into the 'latin chars' box, choose the 'verb' POS and the 'Raanana' subcorpus, and go through the various options looking at the results. You will notice that the program generated other strings to go with this search: ktb,kwtb,ktwb כתב,כותב,כתוב. This is the system's way of trying to get all of the expected conjugations of a verb.
- Try other words and run them through each of the POS filters. Then look at word forms, and you will begin to understand the differences between all of the filters.
Searching Subcorpora by Themselves or Together
- Type qwnsTlcyh קונסטלציה into the 'latin chars' box, choose the 'noun' POS and the 'Haaretz: 08' subcorpus, and click 'submit'.
- Only six results were found. Go through the various options; you will notice that it is hard to see significant trends because of the small sample size. 'Subsections' only lists three sections, and the relative frequency is very small.
- Now try the same word and POS, but this time choose >>ALL GENRES<<. This will search all of the subcorpora simultaneously, and since it is so big it will take a while to retrieve the results.
- About three minutes later, you will see that it found 78 occurrences in the entire corpus. Go through each of the options. Notice particularly that on the 'subsections' page it does not give the amount of occurrences in each subsection, but instead the different subcorpora. This is also the case with the other combined subcorpora. It gives you an idea of which subcorpora this word occurs in, and which it does not (in which case the subcorpora does not show up in the results).
- Now try this with a new word. Search for Twb טוב as an 'adj' in 'Spoken', and click on 'submit'. This gets 48 occurrences. Go through the options and notice the trends.
- Search for this word in >>ALL GENRES<<. This word is common enough to where it will cause the program to stall after a few minutes, and present you with a blank screen. Click the 'back' button on your browser to retrieve the results. Go to subsections and you will see interesting distributions of Twb טוב in the subcorpora. Is this what you would have guessed?
Searching for More Than One Word at Once
- Try searching for yldh ילדה as a 'noun' in 'Journals'. Go to 'word forms'. The plural for this word, yldwt ילדות, is not there since the program does not account for this variant. To overcome this in one search, type yldh,yldwt ילדה,ילדות (with no space) in the 'latin chars' box. Now go to 'word forms;, and you will see that both of these words are included in the results. Now try Eyr,EryM, עיר,ערים and examine these results.
Searching for Phrases
- Search for mSA wmtN משא ומתן as a 'string' in 'ErevErev:03-09'. Go to 'words before/after'. These words come before and after the entire phrase that you searched. Try other phrases, as strings and as other parts of speech, then check word forms to see what it produces. An additional example is if you wanted to find rch רצה as in "she runs" rather than "he wanted". To find this you can reduce ambiguity by searching for hyA rch היא רצה.
Searching with Regular Expressions
- Learning to Represent Word Characters
- Learning to Represent Quantifiers
- Learning to Represent any Character
- Learning to Use Square Brackets
- Learning to Use Parentheses
- Learning to Represent Word Boundaries
- Learning to Search for the Beginning and End of Words
Searching for Punctuation
- In order to search for punctuation, it is necessary to put a front slash before it. As an example, type zh\. .\זה to find all examples of zh זה with a period after it. Do not type zh. .זה (without the front slash) or you will get results of zh זה with any character after it. Try this with other forms of punctuation.
Cutting Out Results by Hand
- Type b ב as a 'noun' in 'Modern Fiction-Tran:05-10'. Notice that you get the forms of the preposition b ב that you are looking for, but you also get byN בין, bN בן, etc., which you do not want. To cut these forms out, search for b -- (byN|bN) as the same POS in the same subcorpus. This will only cut out the exact forms byN בין and bN בן. To cut out more forms, you will have to include more in the search or use regular expressions to list more.
- As a further example, suppose that you want to search for byt בית in phrases other than byt spr בית ספר. To do this, search for byt \w+ -- spr בית \ו+ -- ספר. This will give you all results except for byt spr בית ספר.
Using Advanced Search
- Click on 'advanced search' under the 'submit' button. On that page, click on 'instructions'. This will give you instructions you can follow as you use the search.
- Now click 'search by hand'. This is the only fully usable advanced search so far. Try searching for \wx\wx\w as an 'adv' in 'Tanach'. This will give you quite a few different segholate nouns that occur in the Hebrew Bible.
- You can search for many other words and constructions using this search. Just remember that you need to be flexible, and include the possibility of different orders (for example, the dagesh might come before a vowel, or the vowel before the dagesh). You should also put a question mark after the vowel if you want the program to find the word unvoweled along with voweled.

Questions/Problems

What are some common errors to avoid?

Typing the noun with an article, unless you mean to limit your search (the tool automatically looks for nouns and adjectives with and without the article; if you type it with the article it won't find any examples without it).
Choosing the wrong POS filter (looking for an adjective using the 'noun' POS, looking for all instances of a word using the 'adv' POS, etc.).
Choosing 'noun' when looking for a phrase (you should usually choose 'adv' or 'string').

Why do I want to filter my results?

If you are trying to find examples of a particular word or construction, and the program finds hundreds of results of something else that happens to match what you are looking for (because of the morphological ambiguity of Hebrew), it can be very time consuming and annoying to search through the citations one by one to find the ones you really want. If you can figure out a way to filter out the bad ones, you can save yourself many hours.

How do I take care of filtering myself, and bypass the POS filters?

Choose 'string' and type a regular expression that matches what the POS filters do, or that varies them. For example:
\bSqr\b בשקר\ב\ will find ONLY the word Sqr שקר with no prefixes or suffixes.
\b[bl]?byt\b ב[בל]?בית\ב\ allows byt בית, bbyt בבית, lbyt לבית, and nothing else.
\b[mS]?h?AH\b ב[מש]?ה?אח\ב\ allows AH אח, hAH האח, mAH מאח, mhAH מהאח, SAH שאח, and ShAH שהאח, and nothing else.
\b[mS]?h?AHy?(K|w|h)?\b ב[מש]?ה?אחי?(ך|ו|ה)?\ב\ allows all of the above and the singular pronoun endings.
It is important to remember that the program will find only the forms you enter in this way, but if there is a punctuation mark without a space between it and another word, those marks will show up in the results as well. For example, in the search for \b[bl]?byt\b ב[בל]?בית\ב\, mSq-byt משק-בית would show up in the results. You will also find a lot of typos in the papers this way.
The program can deal with complex expressions efficiently, so you can get quite a bit of control over what you are searching for if you want it.

Are there copyright issues in making these texts available?
- Some of these texts are in the public domain, and in that case the whole citation is available for viewing. Others, however, are copyrighted and can only be viewed partially, as a paragraph in context. This makes them legal under the Fair Use law, since they are free to the public, for academic purposes, and only presented in small amounts. For more on this law, click here.
What browser is recommended for this site?
- The Firefox browser is recommended, although you should be able to use this program with other browsers as well.
Why do characters sometimes show up incorrectly?
- Due to the right-to-left nature of Hebrew, sometimes when numbers or special characters are next to other characters they can be displayed in an unusual way.
Other concerns not listed here

If your concern is not listed here, feel free to contact Justin Parry at justin.parry@mail.utexas.edu with any additional questions. A link is provided at the bottom of the website which will send you directly to mail.

search tool for students and scholars