German Stop Words

24th January 2012·1 min read

Hey all,

I’m doing some text mining in the last time, so I needed a reliable list of german stop words.
The only real advanced version I have found was the lucene ”GermanAnalyzer”. That is the seed of the following list I wanted to share with you.

I already formatted this as an array that is put into a HashSet, so you can easily use it within your Java code via HashSet#contains(token).

public final static HashSet<String> GERMAN_STOP_WORDS = new HashSet<String>(  
     Arrays.asList(new String[] { "and", "the", "of", "to", "einer",  
      "eine", "eines", "einem", "einen", "der", "die", "das",  
      "dass", "daß", "du", "er", "sie", "es", "was", "wer",  
      "wie", "wir", "und", "oder", "ohne", "mit", "am", "im",  
      "in", "aus", "auf", "ist", "sein", "war", "wird", "ihr",  
      "ihre", "ihres", "ihnen", "ihrer", "als", "für", "von",  
      "mit", "dich", "dir", "mich", "mir", "mein", "sein",  
      "kein", "durch", "wegen", "wird", "sich", "bei", "beim",  
      "noch", "den", "dem", "zu", "zur", "zum", "auf", "ein",  
      "auch", "werden", "an", "des", "sein", "sind", "vor",  
      "nicht", "sehr", "um", "unsere", "ohne", "so", "da", "nur",  
      "diese", "dieser", "diesem", "dieses", "nach", "über",  
      "mehr", "hat", "bis", "uns", "unser", "unserer", "unserem",  
      "unsers", "euch", "euers", "euer", "eurem", "ihr", "ihres",  
      "ihrer", "ihrem", "alle", "vom" }));  

Note that there are some english words as well, if you don’t need them, they are just in the first section of the array. So you can easily remove them ;)

If you have a good stemmer, you can remove other words as well.

How did I extract them?

These words are the words that had the highest word frequency in a large set (> 10 Mio.) of text and html documents.

Have fun and good luck!


Thomas Jungblut

I'm Thomas Jungblut - welcome to my personal blog. Here you'll find a lot of posts around all the things I'm interested in writing about. Big Data, Bulk Synchronous Parallel, MapReduce, Machine Learning, Clustering, Graph Theory, Natural Language Processing, Computer Science and Open Source in general.

© Thomas Jungblut 2024. Built with Gatsby