logged in as: Shmuel Bolozky
Instructions
click on instructions link in search bar to access these instructions at any time
click on dark red text to expand and collapse information
General Information about hebrewCorpus
- HebrewCorpus, which contains over 150 million words, was developed by The National Middle East Language Resource Center (NMELRC). More information about NMELRC is available here. HebrewCorpus is intended for scholars, teachers, and students of the Hebrew language. It gives the user the ability to:
- Search subcorpora either separately or in combination.
- Search for words alone or in combination, including phrases.
- Choose from POS filters. This corpus is not tagged for part of speech, but it looks at affixes and word structure to attempt to predict what part of speech a word is. This still leaves some ambiguity, but it saves a lot of time.
- Use most "regular expression" language, which serves to narrow searches considerably.
Instructions on Searching hebrewCorpus
- Basic Search Instructions
- Either type a search into the 'latin chars' box using transliterated latin characters or type into the 'hebrew chars' box using Hebrew characters.
- Choose a part of speech from the drop-down box to filter the results based on structure and affixes; if you do not want to filter the results, choose 'string'.
- Choose a subcorpus from the drop-down box to search.
- Click on 'submit'.
- Detailed Search Instructions
- Searching for Words
- Type search word or phrase into one of the two boxes on the left (for specialized searches, see "Advanced Search Capabilities"). Type it WITHOUT vowels.
- EITHER: Type transliteration into the 'latin chars' box, using the ht transliteration system (click 'transliteration help' to see chart; note that there is a one-to-one correspondence between Hebrew and the transliterated letters).
- NOTE:
- Alef is a capital 'A', vav is a 'w', tet is a capital 'T', shin is a capital 'S', ayin is a capital 'E', and chet is a capital 'H'.
- The 'sofit' letters that change form at the end of the word (F, C, N, K, M, or ם, ך, ן, ץ ,ף ) are capitalized when at the end of a word.
- The program uses the letters b and p to represent all instances of the letters bet and pe (although pe sofit is a capital F). However, if you type the letters, the program will convert them for you; (e.g. if you type Av or sfr, it will automatically convert them to Ab and spr).
- OR: Type Hebrew script into the 'hebrew chars' box (your computer must be set up to type Hebrew script).
- Since users may prefer one method over the other, throughout these instructions and the tutorial, both transliteration and Hebrew characters will be used in explanations.
- Type the base form you want to search for, unless you specifically want to limit your results. Avoid typing prefixes (such as the definite article) and suffixes (such as the plural). For example:
- Type byt בית rather than hbyt הבית, bn בן rather than bnyM בנים, and yldh ילדה rather than yldwt ילדות.
- Choose a part-of-speech filter. These filters do not always correspond to the parts of speech they are named after; see "Understanding Part-of-Speech Filters" for more details. If you do not want to filter your results, choose 'string'.
- Choose a subcorpus to search.
- Click on 'submit'.
- NOTE: If you have not chosen a subcorpus or a part-of-speech filter, a message will alert you to do so and no search will be performed.
- Remember a space literally means a space to the program; i.e., if you type two or more words it will look for the two as a phrase, rather than as individual units.
- Choosing a Subcorpus
- Select one of the subcorpora to search, or search them together using one of the combinations, such as --All Literature--. See "Information About Subcorpora" for information about each subcorpora, as well as a downloadable list of word totals.
- Understanding Part-of-Speech Filters
- String
- String will not apply a filter to the results.
- EVERYTHING that matches your string will be returned, no matter what is before or after it.
- For example, if you type ktb כתב, you will get every example of that string in the subcorpus, including the following:
- lktbh לכתבה
- nktb נכתב
- mktb מכתב
- hhtktbwywt ההתכתבויות
- bktbwtyh בכתבותיה
- hhktbh ההכתבה
- ktbyK כתביך
- NOTE: Any part-of-speech (POS) filter other than 'string' will filter your results to:
- The search string alone, with only a space or punctuation before or after it (except for 'verb', which includes other regular conjugations)
- The search string with suffixes and prefixes that go with the chosen POS
- All other instances of the string will be filtered out (those that are not the bare form, or with the known suffixes and prefixes).
- Noun
- If you choose 'noun' the program will accept:
- the bare search string typed in
- the definite article h-ה
- the masculine plural ending -yM ים-
- the dual ending –yyM יים-
- the conjunction w-ו
- the subordinators S-ש and kS-כש
- the attached prepositions b-ב, l-ל, k-כ, and m-מ
- the singular and plural pronoun endings
- NOTE: Plurals ending in -wt ות- and irregular nouns ending in –yM ים- do not come up in a singular noun search and must be searched for separately.
- If you type in ktb כתב as a 'noun', the program will accept (for example):
- ktb כתב
- SktbyM שכתבים
- wktbyM וכתבים
- mhktb מהכתב
- bktb בכתב
- kktbM ככתבם
- kSktby כשכתבי
- wlktbyw ולכתביו
- But it will not accept (it will filter out):
- nktb נכתב
- wtktby ותכתבי
- bmktb במכתב
- ktbwt כתבות
- ktbty כתבתי
- If you search for a noun that ends with the feminine marker h-ה, the program knows to allow forms that have a t-ת when it is in the construct form or there are pronoun endings.
- If you type in yldh ילדה, it will accept (for example):
- yldh ילדה
- hyldh הילדה
- yldty ילדתי
- yldtnw ילדתנו
- If you choose 'noun', it does NOT mean all your results will be nouns. It only means that given the morphological ambiguity of Hebrew, they COULD be nouns.
- For example, if you type in ktb כתב and choose 'noun', some of the results will be unambiguously nouns:
- bktb בכתב
- ktbyM כתבים
- while others will simply be ambiguous:
- ktb כתב (could be ‘(hand)writing’ or ‘he wrote’)
- wktb וכתב (could be ‘and (hand)writing’ or ‘and he wrote’)
- Sktb שכתב (could be ‘that (hand)writing’ or ‘that he wrote’)
- The program WILL filter out forms that unambiguously are NOT nouns, which can be very helpful.
- Choosing a filter, therefore, reduces the number of false hits, but does not eliminate them entirely.
- Adj
- If you choose 'adj', the program will accept:
- the bare search string typed in
- the definite article h-ה
- the conjunction w-ו
- the singular and plural masculine and feminine forms (if you search with the masculine singular form)
- NOT the prepositions or the pronoun endings
- If you type in HkM חכם, the program will accept (for example):
- HkM חכם
- Hkmh חכמה
- hHkmyM החכמים
- whHkmwt והחכמות
- but it will not accept (it will filter out):
- ShHkM שהחכם
- Hkmy חכמי
- If you want to try to find forms like ShHkM שהחכם and Hkmy חכמי, you need to search for HkM חכם as a noun, not as an adjective.
- Alternatively, you could search directly for ShHkM שהחכם and Hkmy חכמי as strings, thereby bypassing the POS filters.
- Again, If you choose 'noun' or 'adj' or any other part of speech, it does not necessarily mean that you are searching for that part of speech; it only means you want to allow that particular set of prefixes and suffixes through the search filter.
- Adv
- If you choose 'adv', the program will accept:
- the bare search string typed in
- the conjunction w-ו
- and nothing else
- This category is handy for adverbs, but it is also useful when you are searching for a specific form.
- For example, if you want to find the noun ktb כתב only when it is preceded by b-ב (and not all the other possible forms of ktb כתב), type bktb בכתב and choose 'adv'. It will accept:
- bktb בכתב
- wbktb ובכתב
- and nothing else. The same technique works if you want to find a specific verb form (nktb נכתב) as opposed to all the conjugations of that verb.
- If you choose 'adv', it does not necessarily mean you think the word is an adverb; it just means that you want the filter to cut out everything except the specific form you typed in, as well as that form prefixed by w-ו.
- If you want to search for a form completely by itself (without the w-ו), read about marking word boundaries under "Searching with Regular Expressions".
- If you choose 'adv', it is almost the opposite of choosing 'string'. 'String' accepts every occurrence of the string in the subcorpus, no matter what characters surround it, while 'adv' accepts only what you typed as an isolated word, that word preceded by w-ו, and nothing else.
- Verb
- If you choose 'verb', the program will accept:
- the search string typed in
- the perfect and imperfect verb regular conjugation suffixes and prefixes
- the regular infinitive and imperative forms
- the passive participle form
- the conjunction w-ו
- the subordinators S-ש and kS-כש
- Because of the programming, it will also accept forms with the definite article h-ה and the singular and plural pronoun endings.
- NOTE: The program assumes you will type in at least one verb form. If you choose to be more specific, separate different forms by a comma WITH NO SPACES. You can also see all of the different forms by typing in the masculine singular (hwA הוא) form. Usually, the more forms of the masculine you type in (past, present, future), the less results the program will return and the less problematic the results will be. You may only need to type in one form, but if you do not get all of the results you want, type in the different forms yourself.
- If you type in ktb,yktwb כתב,יכתוב, it will accept (for example):
- ktb כתב
- lktwb לכתוב
- ktbw כתבו
- Syktwb שיכתוב
- kSktbty כשכתבתי
- wktbnw וכתבנו
- Because the program more or less mechanically applies rules without understanding what the resulting forms mean, it will also accept some unwanted forms like:
- ktby כתבי (the writing of)
- nktb נכתב (to be written; niphal)
- The program is set up to understand the different binyanim in its searches, but note that it will only tell you on the 'summary' page which binyan it found if the search is performed in the 3ms past tense and it is a regular verb; otherwise it will just say 'vPaal'.
- The program also tries to handle hollow, defective, and other special verbs correctly. Its analyses are not perfect, however, so you should check the search strings it actually applies on the 'summary' page.
- Advanced Search Capabilities
- Searching for More than One Word at Once
- If you search for more than one word at once, it will sort the results together. This can be handy, for example, if you want all examples of both a feminine or irregular noun and its plural.
- To search for more than one word at once, type in the words separated by a comma. Be sure NOT to add extra spaces.
- To find 'father' and 'fathers', type: Ab,Abwt אב,אבות and choose 'noun'.
- To find two different ways to say 'finally', type lbswF,swF swF לבסוף,סוף סוף and choose 'string'.
- Remember that even though there is a standard orthography for plene writing, once in a while an alternate spelling is used. For example, Tlwyzyh טלויזיה is sometimes spelled as Tlwwyzyh טלוויזיה; the same is true of bEyh בעיה or bEyyh בעייה. You may not want both results, but searching for both of them will give you a fuller picture of the word.
- If the words you type in need different POS filters (an adverb and a noun, for example), results will be unpredictable.
- NOTE: Searching for more than one word at once when the 'verb' POS filter is chosen is effective only for different forms of the verb; searching for other things under the verb POS filter gives you unpredictable results that you probably do not want.
- Searching for Phrases
- Hitting the space bar will cause the program to look for a space; this means that if you type two words (or more) separated by a space, it will look for them as a phrase.
- Most of the POS filters make no sense when searching for phrases, but they can sometimes be helpful.
- Normally you would choose 'string' to find the phrase no matter what else is around it. Choosing other POS filters allows the additions that each filter permits.
- Searching with Regular Expressions
- Using regular expressions can significantly increase the power of your search.
- There are several ways to search for certain things using both the POS filters and regular expressions; learning them (as shown below) will open up a world of unique searches.
- You can use the backslash character in conjunction with certain letters to mean specific things:
- \w ו\ means any word character. For example, searching for tlm\wd תלמ\וד finds tlmwd תלמוד, tlmyd תלמיד, etc.
- \s ס\ means any space character. This can also be achieved by inserting a space; typing \s ס\, however, can be a good way to eliminate doubt about whether you inserted a space. For example, searching for SM\slb שם\סלב or SM lb שם לב will both find instances of this idiom.
- \b ב\ means a word boundary. For example, searching for nSyM\b נשים\ב will cut out the forms nSymh נשימה and nSymyM נשימים.
- You can search with more than one word character in a row as a very effective way to create word skeletons. For example:
- Searching for t\w\ww\wnh ת\ו\וו\ונה will bring up several results of the rare 3fp form in the paal, with any root inside of it.
- You can use quantifiers with word characters to mean specific things:
- \w? ?ו\ means 0 or 1 word characters. For example, AwTw\w? ?אוטו\ו will give you AwTw אוטו, AwTwM אוטום, AwTwq אוטוק, etc.
- \w+ +ו\ means 1 or more word characters in a row. For example, AwTw\w+ +אוטו\ו will give you all words beginning with AwTw אוטו, but not AwTw אוטו itself.
- \w* *ו\ means 0 or more word characters in a row. For example, AwTw\w* *אוטו\ו will give you AwTw אוטו and all words beginning with AwTw אוטו.
- You can also use quantifiers with specific characters. For example:
- Searching for Sy?rwt שי?רות will find both Syrwt שירות and Srwt שרות.
- Searching for bEy+h בעי+ה will find both bEyh בעיה and bEyyh בעייה.
- Searching for Ely* *עלי will find El על, Ely עלי and Elyy עליי.
- You can search for any character (a space, letter, punctuation mark, number, etc.) by inserting a period (.) into the search. For example:
- Amrw." ".אמרו will find Amrw אמרו with a letter, space, punctuation mark, or anything else after it, followed by a quotation mark.
- NOTE: If you want to search for an actual period, and not any character, the period would need to be preceded by a front slash (as in \.).
- You can also use quantifiers with the period. For example:
- Searching for ch.?l צה.?ל as an 'adv' will find chl צהל, ch"l צה"ל, and even the unusual ch'l צה'ל.
- Searching for qSh Mawd.+mAwd qSh קשה מאוד.+מאוד קשה as an 'adj' will find all texts that include both of these expressions within them.
- Searching for hyy.*Tq היי.*טק will find hyy-Tq היי-טק, hyy Tq היי טק, and hyyTq הייטק, among other things.
- You can use the caret ^ with specific characters to exclude them from the search. For example:
- Searching for tE[^l]h תע[^ל]ה will exclude tElh תעלה from the results, but it will include tESh תעשה, tEnh תענה, etc.
- You can do this with a list of characters, so it will list anything but what you cut out. For example:
- Searching for \b[^AEhH]\wwy\b \ב[^אעהח]\ווי\ב will find any masc. sing. final-hey passive participle that does not begin with a guttural, such as bnwy בנוי and qnwy קנוי.
- You can also use square brackets to indicate a list of things, one of which can go in that position in the word. For example:
- [ywh] [יוה] means the letter has to be either yod, vav, or hey, so you could search for bt[ywh] [בת[יוה to find bty בתי, btw בתו, or bth בתה.
- You can also use quantifiers with the list:
- [ywh]+ +[יוה] means any combinations of yod, vav, and/or hey in a row. For example you could search H[ywh]+ +[ח[יוה to find some interesting distributions.
- [yw]? means either yod, vav, or nothing. For example you could search for byt[yw]? ?[בית[יו to find byt בית, byty ביתי, and bytw ביתו.
- You can use ^ and $ (without brackets) to indicate the beginning and end of a word, respectively. In this search, it is necessary to use the ‘string’ POS. For example:
- Searching for ^An אנ^ finds every word that begins with An אנ, such as AnHnw אנחנו and AnSyM אנשים.
- Searching for wty$ $ותי finds every word that ends with wty ותי, such as tyyrwty תיירותי and zkwty זכותי.
- You can use parentheses and the vertical bar to indicate alternation between whole forms. For example:
- Searching for m(mny|mK|mnw|mnh|Ay?tnw|kM|kN|hM|hN) (מ(מני|מך|מנו|מנה|אי?תנו|כם|כן|הם|הן, although a long string, will give you results with all of the pronoun suffixes for the preposition mN מן.
- Put a question mark after the parentheses to include the portion outside of it in the results as well; Hbr(h|wt|yM)? ?(חבר(ה|ות|ים will give you Hbr חבר, Hbrh חברה, HbryM חברים, and Hbrwt חברות in the results (having no question mark would exclude Hbr חבר).
- You need to think carefully about how your regular expression is going to interact with the POS filter you have chosen. For example:
- If you type \bspr בספר\ and choose the ‘adv’ filter, the only thing it will allow is spr ספר.
- If you want complete control, choose 'string', and only what you cut out yourself will be cut out.
- Using all of these regular expressions in combination can be a very powerful way to design precise searches. For example:
- Searching for m?Ew?nyyN[^t|^wt] [מ?עו?ניין[^ת|^ות finds mEwnyyN מעוניין, mEnyyN מעניין, and EnyyN עניין in the masculine singular and plural.
- Using Hebrew Script in Regular Expressions
- As shown in the examples above, regular expressions work in Hebrew script as well as they do in English. Just remember expressions need to be entered on the opposite side of the character, since Hebrew is written from right to left.
- If you are unsure about where the special characters are found on the Hebrew or Hebrew QWERTY keyboards, it is possible to see the layout online. If you still cannot find the characters you need on your keyboard, you can open Character Palette on Macs (available under 'International'), or Character Map on Windows and enter the characters manually.
- NOTE: Once the results are displayed, most browsers display the backslashes and parenthesis in bizarre and unpredictable ways. This has no effect on your search; it is just a problem with displaying Hebrew.
- Cutting Out Results by Hand
- Like searching with regular expressions, cutting out results by hand can be a powerful tool in refining your searches.
- Sometimes you know that the ambiguous morphology of a particular form is going to give you many false hits you do not want.
- To cut out forms that otherwise would be found by a search, type your search word or expression, a space, one or two dashes, a space, and then a word with or without regular expressions that indicates what you don't want. For example, spr -- mspr ספר -- מספר cuts out mspr מספר from the results.
- You can use a vertical bar to cut out multiple things.
- For example, if you search for all forms of the verb dybr,mdbr דיבר,מדבר, you will get many examples of dbr דבר and other undesired forms. To avoid this, you can type dybr,mdbr -- ^w?h|^w?dbr דיבר,מדבר -- ^ו?ה|^ו?דבר. If you perform this search as a verb on ynet, you will cut out 15,832 instances of: dbr דבר, dbry דברי, hmdbr המדבר, dbrw דברו, wdbr ודבר, wdbrw ודברו, wdbry ודברי, hmdbryM המדברים, hmdbrt המדברת, whmdbr והמדבר, hmdbrwt המדברות, and whmdbrt והמדברת (all of these unwanted words account for more than half of the results found for the particularly ambiguous dybr,mdbr דיבר,מדבר searched by itself).
- This does not remove all of the ambiguity, but it does cut out a significant amount and saves a lot of time. It is valuable for many reasons, one being that you can look at the 'words before/after' page without wondering if an undesired form is producing the results.
- It is also possible to cut out words or letters before or after the searched word, another effective way to reduce ambiguity or choose more selectively. For example, if you want to search for HwC חוץ as in 'outside', but not get results for 'apart from', you can perform this search: \bHwC \w+ -- m\w+ +בחוץ \ו+ -- מ\ו\ . This will cut out all instances in which m-מ follows HwC חוץ, such as the expression HwC mzh חוץ מזה. This will cut out more than half of the results.
- NOTE: Whenever you search for words next to each other, it throws off the results on the 'word forms' page. It also presents forms on the 'words before/after' page that you wouldn't expect because it is presenting words next to the whole usage.
- Searching for Punctuation
- Many punctuation marks can be searched for as is, but some have corresponding symbols that are also used as regular expressions. The best and easiest way to overcome this is to put a backslash before them. This tells the program to look for the exact punctuation mark. For example, g\’wrg\’ '\ג\'ורג, Amr\. .\אמר, etc.
- Using the Advanced Search Page
- There is a separate page on this site for the advanced search. It is still being developed, but so far it has:
- The capability of including vowels in your search.
- The ability to narrow your results by choosing more specific verb forms, such as pual and hitpael.
- This page has its own set of instructions. To access the page, click 'advanced search' under the subcorpus drop-down box. The instructions will explain what all of the options are.
Results- Basic Information about Results
- Most results come back after a few seconds wait. Less common single words give you results in about 10 seconds. If the word is common or the subcorpus is large, it may take from 30 seconds to a minute to get the results (shorter common words searched as a string can sometimes take longer). For phrases, the amount of time varies.
- If you search for something extremely common (such as an ayin by itself or multiple common words), the program will sometimes abort that specific search, and go to a blank screen. A trick that often works for this is to hit the back button on your browser; this will often display the search results.
- As can be expected, the wait time is longer if you search all genres at once. In some cases the wait time is only slightly longer, and in other cases it is doubled or tripled.
- One good way to save your results is to copy what you want from the corpus, open an excel document, select the rows you want to paste the information into, and go to paste special. Under paste special, choose paste as unicode text. This will create organized rows that you can save for later viewing or side-by-side comparison.
- Summary Page
- This page automatically comes up first after a search, and you can return to it at any time by clicking on 'summary'.
- This page gives the following summary information about your search:
- the word you searched for in transliterated characters
- the search string you typed in, in both scripts; this may include extra search strings that the program uses to try to predict alternations, as under 'verb'
- the subcorpus or database that you searched
- the time it took the search engine to perform the search (this will generally be less than the actual time experienced by you, since it does not include the time it takes for the server to receive your request or to serve the results back to you).
- the POS filter you chose
- the POS filter that was actually used
- the number of occurrences or "hits"
- what that number translates to in terms of words per 100,000 words of the subcorpus
- NOTE: The latter bit of information is useful for comparative purposes. The subcorpora are of vastly different sizes, so the actual number per subcorpus may be misleading; the number per 100,000 words is more easily compared.
- Citations Page
- Click on 'citations' in the dark blue bar to see the 10 words before and the 10 words after the "hits".
- By default, these citations are sorted by the word that appears directly before the word you searched for.
- Note that this sorting is done by the whole word not by the root, so spr ספר, hspr הספר, and whspr והספר are nowhere near each other.
- The word directly before is repeated at the left-hand side of the page under 'sort word' so you can quickly glance through the citations and notice patterns, collocations, etc.
- If you would rather see the citations sorted by the word directly after the "hits", click on the sentence near the top left of the page that reads 'sort by word after'.
- The citations are shown 100 at a time.
- If your search returns more than 100 citations, they will be organized into pages which you can access by clicking on the dark red numbers at the top.
- The 'subsection' column indicates which subsection of the subcorpus the example comes from; e.g. NEWS, etc.
- If you want to see the exact reference for the citation, click on the 'subsection' heading and it will change to 'reference'.
- The references give information about the specific source; most are helpful although some are numbered and basically incomprehensible to the user.
- If you want to see more context than 10 words before and 10 words after, click on the number to the left of the citation.
- This will bring up a separate window that displays a paragraph of surrounding context from the source it comes from (or the whole citation if there are no copyright issues).
- You may have to use your browser's search function (command + f on a Mac or control + f on a PC) and Hebrew script to find the word or expression you searched for, since it is not highlighted already.
- Subsections Page
- Click on 'subsections' to see the total occurrences for that word or expression in the various subsections of the subcorpus you searched. This can be useful to see where a word is generally used, such as kdwr כדור in SPOR. If you search using one of the combined subcorpora, 'subsections' will list the subcorpora rather than individual subsections.
- The results in this section are ordered from most frequent to least frequent in terms of the number of occurrences.
- The frequency per 100,000 for each subsection is also given; this is valuable because each subsection is of a different size, and this puts the word or expression into proper perspective for better comparison.
- Word Forms Page
- Click on 'word forms' to see the exact forms that your search found, ordered by frequency.
- Click on any word in the word form list and a separate window will open with the citations for only that word form.
- Examine this list to get important hints about normal usage.
- Examine this list to identify what kinds of false "hits" you are getting so you can work to eliminate them, either by changing the POS filter, using regular expressions, or cutting them out by hand.
- NOTE: After every search, it is strongly suggested that you examine this list before you take the rest of the results seriously.
- See if there are any forms you expected to find that are not there.
- See if there are any forms you did not expect to find that are there.
- See if you can figure out what it is about the expression you typed in (and about Hebrew morphology) that would have created any problems you encountered.
- Words Before/After Page
- Click on 'words before/after' to see a list of the common words directly before and directly after the "hit", ordered by frequency.
- Examine these lists to scope out the main usages and collocations of the word you searched for.
- Examine these lists to identify structures that you had not intended to search for that you want to cut out.
- Click on a word in the list to see the citations with only that particular word before or after.
- A separate window will open showing you just those citations.
About Each Subcorpus- Click the following link to download a rich-text document with a comprehensive list of word totals for the subcorpora (and their subsections) found on the site:
- About each subcorpus, in order of appearance (*according to Wikipedia):
- Arutz 7 is a written news website for an Israeli media network, which is identified with Religious Zionism and as the voice of the Israeli settlement movement.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
- Beginning Newspapers is made up of two subsections: Yanshuf and ShaarLaMatchil. Both of these newspapers contain easy Hebrew. Although a limited amount of material from them is available here, it is a great resource to study voweling. This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
- Erev Erev in Eilat is the first local newspaper in Israel. It was established in 1962.
- Haaretz is Israel’s oldest daily newspaper. It was founded in 1918. It is said to be Israel’s most influential newspaper and is mostly read by the intelligentsia and the political and economic elites of Israel. It is described as liberal and left wing.* The subcorpus of Haaretz that is from the years 1990-91 is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
- Maariv is a daily tabloid published in Israel. It is said to give a balanced representation of the diverse views that abound in Israeli society.*
- Raanana Shelanu is a local newspaper for the city of the same name in the southern part of Israel.
- TheMarker is an economic news website that offers news on hi-tech business, advertising, media, real estate, labor market, law, automobiles and transportation.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
- Ynet is one of the most popular Israel news websites. It is owned and operated by Yediot Ahronot, but most of its content is original and published only on the website. In 2008, it was Israel’s most popular internet portal.*
- The Tanach is the Hebrew Bible, broken into the subsections Torah, Neviim, and Ketuvim, from which the acronym TaNaKh comes from.*.
- The Mishnah is the first major work of Rabbinic Judaism, and the first major written redaction of the Jewish oral traditions.* This text is provided courtesy of The Structured Mishnah, which is available online here.
- Early Fiction is a collection of fiction from 30 early Hebrew authors that lived during the Haskalah and Hebrew revival. These texts are provided courtesy of Project Ben-Yehuda, which is an internet site devoted to making works in the public domain freely available. To visit this site, click here.
- Modern Fiction-Orig is a collection of fiction written originally in Hebrew, as opposed to translated into Hebrew. Works are not included in their entirety; rather as excerpts from several books. It is divided into subsections by year of publication.
- Modern Fiction-Tran is a collection of fiction translated into Hebrew, as opposed to written originally in Hebrew. It includes works from 24 different languages, which are separated into subsections.
- Movies is a compilation of subtitles from 59 movies that have come out in the last forty-five years or so. These movies are divided into subsections by genre.
- Spoken is a group of transcribed texts, from six distinct conversations (which coincide with the subsections): a meeting at a high-tech plant discussing personnel issues, a son telling his father about a trip he made to China and Mongolia, an informal car drive back from a wedding, a family going to have dinner and having it, a boy talking with his girlfriend, and a soldier and his commander discussing personal issues. Spoken is provided courtesy of CoSIH, or The Corpus of Spoken Israeli Hebrew. To read more about this project, click here.
- Tapuz Forums is a collection of largely informal discussion that comes from Tapuz, an Israeli web portal.* This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
- Journals is a large collection of specialized periodicals and journals on a variety of topics, largely for an academic audience. Among the topics discussed are medicine, Jewish life, law, education, and society.
- Knesset is a collection of sessions from the Knesset, which is the legislature of Israel. This resource is provided courtesy of Mila - Knowledge Center For Processing Hebrew. To access their site, click here.
- Wikibooks is a collection of free-content textbooks and annotated texts that anyone can edit.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
- Wikinews is a collection of free-content news that anyone can edit. It differs from Wikipedia in that it is written in the format of news stories as opposed to encyclopedia articles.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
- Wikipedia is a collection of free-content encyclopedia articles that can be edited by anyone.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
- Wikiquote is a collection of free-content quotations from prominent people, books, films, and proverbs that can be edited even by visitors to the site.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
- Wikisource is a collection of free-content textual sources. Its library includes novels, non-fiction works, letters, speeches, constitutional and historical documents, laws, and a range of other documents.* It is part of the Wikimedia Foundation, and is used here under the GNU Free Documentation License. To view the site where the original articles are found, click here.
- A number of the newspaper subcorpora have four-letter abbreviations for their subsections. These are mostly self-explanatory, but for clarification they stand for the following:
- CULT: Culture (culture and the arts, etc.)
- INFO: Information (health, computers, help, encyclopedia, etc.)
- LAWS: Laws (crime, investigations, law, etc.)
- LIFE: Lifestyle (tourism, relationships, entertainment, food, etc.)
- MISC: Miscellaneous (opinions, new age, letters to the editor, etc.)
- NEWS: News (headlines, international, regional, etc.)
- RPTS: Reports (various reports)
- SPOR: Sports (various sports articles)
Download Document with Word Counts
Announcements- 27 August 2010: The tutorial is now all available on the site, which will help users quickly learn the program and its uses.
- 22 July 2010: Part of the tutorial is now available online; more will be added soon. A group mailing list has also been created for those interested in staying informed about updates and helpful hints.
- 14 July 2010: Several more subcorpora have been added to hebrewCorpus, as well as a number of combined subcorpora.
- 20 June 2010: An advanced search has been added to the corpus that allows the user to search for words with vowels. See 'advanced transliteration help' for a key to the letters used for the vowels.
- 12 January 2010: The URL for hebrewCorpus has been changed from BYU to NMELRC.
- 7 December 2009: A rough version of the instructions to guide the use of hebrewCorpus has been posted. Feel free to e-mail me with any additional questions or suggestions. Also, a tutorial to complement the instructions is being constructed—check back later for updates.
- 3 December 2009: The English was fixed so it no longer displays the English with brackets around it and in a jumble of Hebrew and English.
- 20 November 2009: All of the subcorpora were combined into an additional searchable subcorpus called 'All Newspapers'.
- 11 November 2009: The corpus is being worked on to rid it of strange characters that aren't supposed to show up in the corpus (such as â¢, ¼Ö, and unconverted English characters).
- 19 October 2009: The corpus has been put online through BYU.
- 27 August 2009: Work has been started to convert the corpus data and create a beta version.
Tutorial- Introductory Note
- The tutorial can complement your knowledge of the instructions, or it can be an effective way to quickly learn the uses of the program by getting hands-on practice. In each of these searches, it is assumed that you will use the 'latin chars' box in your searches, but the Hebrew is also provided in case you would prefer that. Good luck!
- First Search: Looking for a single word
- First, click 'instructions' to the right of the subcorpus drop-down box to pull up this tutorial in a separate window, so you can continue to read it as you search.
- Click on 'transliteration help' to see a chart of the transliteration system in a separate window.
- Type mwH מוח into the 'latin chars' box.
- Choose 'noun' from the 'part of speech' drop-down box.
- Choose 'Journals' from the subcorpus drop-down box.
- Click on 'submit'.
- Wait about 15 seconds, since bigger subcorpora take longer to search.
- Examine the 'summary of search results', and note the total number of occurrences, and how frequent this is in the subcorpus per 100,000 words (which is located right below the number of occurrences). The latter statistic is important, because it helps you to compare the subcorpora of different sizes.
- Click on 'citations' in the dark blue bar.
- As you can see, each example gives you the word in context with 10 words before and 10 after.
- Scroll down and look at a few of the citations. You will notice that the citations are organized by the word before; to make this easier to see, the word before is also shown under 'sort word'. Scroll back to the top.
- Note that there are 44 pages of results. Clicking on additional pages provides another 100 results, thus there are 44 pages for 4,347 words.
- Click on one of the numbers at the left, and a new window will open with even more context, if there is more in the surrounding paragraph.
- Click on 'sort by word after' to sort the citations by the word after instead.
- Next, click on 'subsections' in the dark blue bar.
- Study the amount of occurrences and relative frequencies of this word in the various sections of Journals. MED:Neurology has by far the most occurrences, which can be expected given the subject matter. Note that even though this page is sorted by total number of occurrences, the frequency is what really shows which subsection uses the word the most. For example, even though EDU:Biology has more than three times the amount of occurrences of mwH than MED:Cardiology, if you look at the relative frequency you see that in actuality MED:Cardiology uses the word almost three times more than EDU:Biology.
- Now click on 'word forms' in the dark blue bar.
- Notice the different forms in which this word is found: with the definite article, by itself, with prepositional prefixes, with pronoun endings, etc.
- Click on במוח to see the citations for just that word form in a separate tab. This is a good way to refine your search to only the results you want to see.
- Click on 'words before/after' in the dark blue bar.
- Examine the most common words that come before our search word and the most common words that come after. This is a great way to see collocates and figure out common idioms. Is this what you would have predicted?
- Click on שבץ to see 413 citations of it in a new window.
- Click on 'summary' in the dark blue bar to go back to the summary page.
- Understanding the POS Filters
- Type zqN זקן into the 'latin chars' box, choose the 'adj' POS and the 'ErevErev: 03-09' subcorpus, and click 'submit'.
- Go through the various options in the dark blue bar as before. Note particularly under 'word forms' that you get examples of both masculine and feminine, singular and plural. You also get the definite article and the 'vav' prefix.
- Now try searching the same word in the same subcorpus, only with the 'noun' POS chosen. This time under 'word forms' you also get attached prepositions, whereas other results are cut out.
- Next, type sbbh סבבה into the 'latin chars' box, choose the 'adv' POS and the 'Tapuz Forums' subcorpus. Go through the various options once you get the results. Note that you now only get results for the bare word form and one instance of it with a vav.
- Type ktb כתב into the 'latin chars' box, choose the 'verb' POS and the 'Raanana' subcorpus, and go through the various options looking at the results. You will notice that the program generated other strings to go with this search: ktb,kwtb,ktwb כתב,כותב,כתוב. This is the system's way of trying to get all of the expected conjugations of a verb.
- Try other words and run them through each of the POS filters. Then look at word forms, and you will begin to understand the differences between all of the filters.
- Searching Subcorpora by Themselves or Together
- Type qwnsTlcyh קונסטלציה into the 'latin chars' box, choose the 'noun' POS and the 'Haaretz: 08' subcorpus, and click 'submit'.
- Only six results were found. Go through the various options; you will notice that it is hard to see significant trends because of the small sample size. 'Subsections' only lists three sections, and the relative frequency is very small.
- Now try the same word and POS, but this time choose >>ALL GENRES<<. This will search all of the subcorpora simultaneously, and since it is so big it will take a while to retrieve the results.
- About three minutes later, you will see that it found 78 occurrences in the entire corpus. Go through each of the options. Notice particularly that on the 'subsections' page it does not give the amount of occurrences in each subsection, but instead the different subcorpora. This is also the case with the other combined subcorpora. It gives you an idea of which subcorpora this word occurs in, and which it does not (in which case the subcorpora does not show up in the results).
- Now try this with a new word. Search for Twb טוב as an 'adj' in 'Spoken', and click on 'submit'. This gets 48 occurrences. Go through the options and notice the trends.
- Search for this word in >>ALL GENRES<<. This word is common enough to where it will cause the program to stall after a few minutes, and present you with a blank screen. Click the 'back' button on your browser to retrieve the results. Go to subsections and you will see interesting distributions of Twb טוב in the subcorpora. Is this what you would have guessed?
- Searching for More Than One Word at Once
- Try searching for yldh ילדה as a 'noun' in 'Journals'. Go to 'word forms'. The plural for this word, yldwt ילדות, is not there since the program does not account for this variant. To overcome this in one search, type yldh,yldwt ילדה,ילדות (with no space) in the 'latin chars' box. Now go to 'word forms;, and you will see that both of these words are included in the results. Now try Eyr,EryM, עיר,ערים and examine these results.
- Searching for Phrases
- Search for mSA wmtN משא ומתן as a 'string' in 'ErevErev:03-09'. Go to 'words before/after'. These words come before and after the entire phrase that you searched. Try other phrases, as strings and as other parts of speech, then check word forms to see what it produces. An additional example is if you wanted to find rch רצה as in "she runs" rather than "he wanted". To find this you can reduce ambiguity by searching for hyA rch היא רצה.
- Searching with Regular Expressions
- Learning to Represent Word Characters
- Type Any \w\w\wty אני \ו\ו\ותי into the 'latin chars' box as a 'string' in '--All Colloquial--'; \w \ו means any word character. Go to 'word forms'. This will give you all of the instances of Any אני followed by a verb in the past tense 1st person singular form in the three colloquial subcorpora.
- Try searching for \w?ASr ו?אשר\ as a 'noun' in 'Tanach'. This is telling the program to find ASr אשרby itself as well as include any letter before it, in addition to the noun filtering. This will give you several results, including ASr, kASr, MAsr, אשר,כאשר,מאשר and other interesting forms.
- Type hwA\s?\w* rch הוא\ס?\ו* רצה as a 'string' in 'Ynet: 00-09'. This is telling the program to find hwA rch הוא רצה, as well as any words that come between the phrase, since \w* \ו* can mean any amount of word characters, including none. \s? ?ס\ means that there can be one or no spaces there, so the program does not have to find two spaces. With this search, you will find some interesting words between hwA הוא and rch רצה.
- Finally, type nwrA \w+ +נורא \ו as a 'string' in 'Modern Fiction-Orig: 05-10'. This is telling the program to find one or more instances of a word character after nwrA נורא and a space. Look through the different words that come after nwrA נורא here under 'word forms'.
- Learning to Represent Quantifiers
- Now type Sw?lHN שו?לחן as a 'noun' in '--All Literature--'. You will notice that this brings up two spellings of SwlHN שולחן: with the vav or without. Note that there is often a reason why it is spelled without a vav: sometimes it is in construct and sometimes it is from a source that has had its vowels stripped out. To view the source with vowels, go to 'advanced search'.
- Next, search for Ah+ +אח as an 'adv' in '--All Colloquial--'. What results does this give you?
- Type Ey*r עי*ר as an 'adj' in 'Haaretz: 08'. This will give you many forms of Eyr עיר and Eyyrh עיירה, as well as a number of other words.
- Learning to Represent any Character
- Type byt.spr בית.ספר as a 'noun' in 'Wikipedia'. Note that period means any character, including spaces and punctuation. Wikipedia is the largest subcorpus, so it will take about two minutes. Go to 'word forms', and you will find many forms of byt spr בית ספר, but also byt-spr בית-ספר.
- Now type wbkN.? ?.ובכן as an 'adv' in 'Journals'. This will give you wbkN ובכן, but also different forms of punctuation that come after it.
- Search for byN.+wbyN בין.+ובין as an 'adv' in 'Early Fiction'. Look through 'citations' and see what results it retrieved. Some of the results are exactly what you would expect, whereas others are slightly different.
- Type rwdF AHry.* *.רודף אחרי as a 'string' in '--All News--'. Now look at word forms and see what results there are.
- Learning to Use Square Brackets
- Type mw[zs]yqh מו[זס]יקה as a 'noun' in 'Journals'. Square brackets indicate that more than one character can occupy that space. This will give you results for both mwzyqh מוזיקה and mwsyqh מוסיקה .
- Now type mw[^zs]yqh מו[^זס]יקה as the same POS in the same corpus. The caret before the z ז and s ס tell the program to find anything but these two letters. This will find a few occurrences of mwnyqh מוניקה and mwryqh מוריקה; note that the characters in that position are not z ז or s ס.
- Learning to Use Parentheses
- Parentheses, similar to square brackets, are used to show alternate possibilities. For example, type Sn(h|yM|w?t) (שנ(ה|ים|ו?ת as an 'adv' in 'Raanana'. This will give you results for the singular, plural, and the construct form for both singular and plural.
- You can also use a question mark after the parentheses to include the word outside of it in the results as well. For example, search for ly?b(y|K|w|h|kM|kN|M|N|nw)? ?(לי?ב(י|ך|ו|ה|כם|כן|ם|ן|נו as an 'adv' in 'Journals'. This will give you examples of lb לב with the pronoun endings, but it will also give you lb לב in the results.
- Learning to Represent Word Boundaries
- Even though 'adv' lets you limit what the program accepts, it does not let you limit it to only that word form. This is possible by using \b ב\. For example, search for Edh עדה as a 'adv' in '--All News--'. This will give you 1,663 occurrences of wEdh ועדה in addition to the 278 instances of Edh עדה that you want. You can prevent this from happening by performing the same search as \bEdh בעדה\ as an 'adv', which marks the word boundary.
- Learning to Search for the Beginning and End of Words
- You can also use ^ outside of square brackets to mark the beginning of words. Type ^AnTy אנטי^ as a 'string' in 'Journals'. This will give you all words starting with AnTy אנטי.
- In addition, $ outside of square brackets marks the end of word. Type lwgyh$ $לוגיה as a 'string' in 'Journals'. This will give you all of the words ending in lwgyh לוגיה.
- Searching for Punctuation
- In order to search for punctuation, it is necessary to put a front slash before it. As an example, type zh\. .\זה to find all examples of zh זה with a period after it. Do not type zh. .זה (without the front slash) or you will get results of zh זה with any character after it. Try this with other forms of punctuation.
- Cutting Out Results by Hand
- Type b ב as a 'noun' in 'Modern Fiction-Tran:05-10'. Notice that you get the forms of the preposition b ב that you are looking for, but you also get byN בין, bN בן, etc., which you do not want. To cut these forms out, search for b -- (byN|bN) as the same POS in the same subcorpus. This will only cut out the exact forms byN בין and bN בן. To cut out more forms, you will have to include more in the search or use regular expressions to list more.
- As a further example, suppose that you want to search for byt בית in phrases other than byt spr בית ספר. To do this, search for byt \w+ -- spr בית \ו+ -- ספר. This will give you all results except for byt spr בית ספר.
- Using Advanced Search
- Click on 'advanced search' under the 'submit' button. On that page, click on 'instructions'. This will give you instructions you can follow as you use the search.
- Now click 'search by hand'. This is the only fully usable advanced search so far. Try searching for \wx\wx\w as an 'adv' in 'Tanach'. This will give you quite a few different segholate nouns that occur in the Hebrew Bible.
- You can search for many other words and constructions using this search. Just remember that you need to be flexible, and include the possibility of different orders (for example, the dagesh might come before a vowel, or the vowel before the dagesh). You should also put a question mark after the vowel if you want the program to find the word unvoweled along with voweled.
Questions/Problems- What are some common errors to avoid?
- Typing the noun with an article, unless you mean to limit your search (the tool automatically looks for nouns and adjectives with and without the article; if you type it with the article it won't find any examples without it).
- Choosing the wrong POS filter (looking for an adjective using the 'noun' POS, looking for all instances of a word using the 'adv' POS, etc.).
- Choosing 'noun' when looking for a phrase (you should usually choose 'adv' or 'string').
- Why do I want to filter my results?
- If you are trying to find examples of a particular word or construction, and the program finds hundreds of results of something else that happens to match what you are looking for (because of the morphological ambiguity of Hebrew), it can be very time consuming and annoying to search through the citations one by one to find the ones you really want. If you can figure out a way to filter out the bad ones, you can save yourself many hours.
- How do I take care of filtering myself, and bypass the POS filters?
- Choose 'string' and type a regular expression that matches what the POS filters do, or that varies them. For example:
- \bSqr\b בשקר\ב\ will find ONLY the word Sqr שקר with no prefixes or suffixes.
- \b[bl]?byt\b ב[בל]?בית\ב\ allows byt בית, bbyt בבית, lbyt לבית, and nothing else.
- \b[mS]?h?AH\b ב[מש]?ה?אח\ב\ allows AH אח, hAH האח, mAH מאח, mhAH מהאח, SAH שאח, and ShAH שהאח, and nothing else.
- \b[mS]?h?AHy?(K|w|h)?\b ב[מש]?ה?אחי?(ך|ו|ה)?\ב\ allows all of the above and the singular pronoun endings.
- It is important to remember that the program will find only the forms you enter in this way, but if there is a punctuation mark without a space between it and another word, those marks will show up in the results as well. For example, in the search for \b[bl]?byt\b ב[בל]?בית\ב\, mSq-byt משק-בית would show up in the results. You will also find a lot of typos in the papers this way.
- The program can deal with complex expressions efficiently, so you can get quite a bit of control over what you are searching for if you want it.
- Are there copyright issues in making these texts available?
- Some of these texts are in the public domain, and in that case the whole citation is available for viewing. Others, however, are copyrighted and can only be viewed partially, as a paragraph in context. This makes them legal under the Fair Use law, since they are free to the public, for academic purposes, and only presented in small amounts. For more on this law, click here.
- What browser is recommended for this site?
- The Firefox browser is recommended, although you should be able to use this program with other browsers as well.
- Why do characters sometimes show up incorrectly?
- Due to the right-to-left nature of Hebrew, sometimes when numbers or special characters are next to other characters they can be displayed in an unusual way.
- Other concerns not listed here
- If your concern is not listed here, feel free to contact Justin Parry at justin.parry@mail.utexas.edu with any additional questions. A link is provided at the bottom of the website which will send you directly to mail.
- Searching for Words
Site maintained by the College of Humanities. For other questions or suggestions, please contact Justin Parry