German Stop Words

24th January 2012·1 min read

Hey all,

I’m doing some text mining in the last time, so I needed a reliable list of german stop words.
The only real advanced version I have found was the lucene ”GermanAnalyzer”. That is the seed of the following list I wanted to share with you.

I already formatted this as an array that is put into a HashSet, so you can easily use it within your Java code via HashSet#contains(token).

public final static HashSet<String> GERMAN_STOP_WORDS = new HashSet<String>(  
     Arrays.asList(new String[] { "and", "the", "of", "to", "einer",  
      "eine", "eines", "einem", "einen", "der", "die", "das",  
      "dass", "daß", "du", "er", "sie", "es", "was", "wer",  
      "wie", "wir", "und", "oder", "ohne", "mit", "am", "im",  
      "in", "aus", "auf", "ist", "sein", "war", "wird", "ihr",  
      "ihre", "ihres", "ihnen", "ihrer", "als", "für", "von",  
      "mit", "dich", "dir", "mich", "mir", "mein", "sein",  
      "kein", "durch", "wegen", "wird", "sich", "bei", "beim",  
      "noch", "den", "dem", "zu", "zur", "zum", "auf", "ein",  
      "auch", "werden", "an", "des", "sein", "sind", "vor",  
      "nicht", "sehr", "um", "unsere", "ohne", "so", "da", "nur",  
      "diese", "dieser", "diesem", "dieses", "nach", "über",  
      "mehr", "hat", "bis", "uns", "unser", "unserer", "unserem",  
      "unsers", "euch", "euers", "euer", "eurem", "ihr", "ihres",  
      "ihrer", "ihrem", "alle", "vom" }));

Note that there are some english words as well, if you don’t need them, they are just in the first section of the array. So you can easily remove them ;)

If you have a good stemmer, you can remove other words as well.

How did I extract them?

These words are the words that had the highest word frequency in a large set (> 10 Mio.) of text and html documents.

Have fun and good luck!