Please find the attachmentThis assignment is designed to help you learn to use a data mining tool called RapidMiner. This simple
drag-and-drop interface makes it easy to build simple but powerful data mining models. This exercise
will teach you to create a k-Mean Clustering Data Mining model, and then you will explore RapidMiner
more on your own.
To begin, open a web browser and navigate to http://rapidminer.com/download-rapidminer/
Select the download of RapidMiner Studio 7 that is appropriate to your operating system. You should
not need to sign up or create an account. Once the download is complete, launch the installer and
follow the installation wizard.
When you have completed the installation, you will see an introductory screen:
Click the New Process button on the left, then click on Blank.
You should now see the RapidMiner process design window.
Open Notepad or another text editor and copy and paste the following data into the document:
fname, lname, grade, tardies, absences, detention, suspension
Joe, Jensen, 12, 2, 1, 0, 0
Jill, Francis, 11, 3, 4, 1, 1
Marcus, Link, 11, 3, 2, 0, 0
Tyson, Bann, 10, 6, 3, 1, 1
Sammie, Kerr, 12, 7, 5, 2, 2
Jordan, Arndt, 11, 1, 1, 0, 0
Ezra, Lim, 12, 4, 3, 2, 2
Jacob, Jens, 12, 1, 2, 0, 0
Heather, Finner, 10, 1, 2, 0, 0
Save your text document as students.csv. CSV stands for Comma Separated Values, a very common
format for sharing, importing and exporting data between systems. Close the text document and return
to RapidMiner. In the search box in the Operators tab on the left, type Read CSV. Drag the Read CSV
operator over to the Main Process window.
In the Parameters tab on the right, click the folder icon, and navigate to the location where you saved
students.csv. Double click students.csv to add this to the Parameters window and close the dialog box.
In the column separators box (still under the Parameters tab), change the semicolon (as seen in the
screenshot above) to a comma, as shown here:
Connect a spline between the ‘out’ port on the Read CSV operator and the ‘res’ port on the left side of
the Process window. To make splines, simply click on one port, then click on the one you want to
connect it to. Run the model using the blue triangle icon.
You will now see your CSV file display in Results view:
You can explore your data if you would like using the icons on the left of the screen. When finished, click
the Design button to switch back to your Process design. In the search box under the Operators tab,
type Select Attributes.
The k-Means operator creates groups, or clusters, of similar records in a dataset by calculating the
averages for each variable in the dataset and then classifying each record according the averages that
that record is most similar to. In order to use it, all variables in the dataset must be numeric (i.e. it is
impossible to calculate the average for columns that contain first names or last names). We will need to
exclude non-numeric variables, and the Select Attributes operator will help us do that. Drag the Select
Attributes operator into the Process window, and situate it between Read CSV and the ‘res’ port on the
right side. Connect the ‘out’ port on Read CSV to the left-side ‘exa’ port on Select Attributes, then
connect the right-side ‘exa’ port on Select Attributes to a ‘res’ port.
Next you need to configure the Select Attributes operator to keep only the numeric variables in the
dataset. Be sure you have Select Attributes selected (click on it), an in the Parameters tab on the right
side of the RapidMiner window, click the dropdown menu labeled ‘attribute filter type’. Select
Within the Value Type dropdown box, select ‘Integer’. This will eliminate all variables from the dataset
that are not numeric data types. The only ones remaining will be variables for which a mean (or average)
can be calculated.
Test your model by running it using the blue triangle icon. Your model should look like the first
screenshot below, and your results should look like the second screenshot.
Notice that the First Name and Last Name attributes are now gone from the data set, because their
value data type was not Integer.
Switch back to the process design view by clicking the Design button. In the search box under the
Operators tab, type in k-Means (don’t forget the hyphen).
Drag and drop the k-Means operator into your analysis stream after the Select Attributes operator. If
you drag the operator into the Process window and it does not connect into your stream, you can
reconfigure your splines manually. Click the ‘out’ port on the Read CSV operator and then click the ‘exa’
port left side of the k-Means operator. Click the ‘exa’ port on the right side of the Select Attributes
operator, then click the ‘exa’ port on the left side of the Clustering operator. Then click the ‘clu’ port on
the right side of the Clustering operator and connect it to a ‘res’ port on the right side of the Process
window. (Notes: ‘exa’ = exampleset; ‘clu’ = cluster; ‘res’ = resultset.) Your model should now look like
Run the model using the blue triangle icon. You will see in Results view that you are presented with two
clusters. The data are divided into the two clusters, based on the similarity of the means of each
variable. Click on the Folder View icon on the left side. Expand the trees by clicking on the plus signs to
reveal which records are grouped into each of the two clusters:
In examining the above, we can see that rows 2, 4, 5, and 7 are grouped together, while 1, 3, 6, 8, and 9
are connected to one other due to similar means. If you look back at the data listed in step 8, you see
that Jill, Tyson, Sammie and Ezra correspond to the first group of numbers, while the other five students
correspond to the second group. The behaviors of the first group, on average are similar to one another,
and the behaviors of the second group, on average, are similar to one another. Now click on the
Centroid Table icon on the left side of the screen.
The centroids are the averages that were calculated for each variable in each of the clusters. In
examining these centroids, which cluster do you think contains the records representing students who
are most at risk for trouble, and may need some extra care? Note that which grade the student attends
is probably not relevant, but the other indicators or behavioral trouble all have higher averages in
cluster_0 than in cluster_1. Thus, Jill, Tyson, Sammie and Ezra are probably the students who will benefit
from extra care and attention.
Complete the following steps in a Word document and submit it to Moodle by the end of the module.
Switch back to the process design view and use the Parameters tab to generate three clusters instead of
two (hint: the k parameter sets the number of clusters). If the Parameters tab isn’t showing the right
options, click on the Clustering operator, and the Parameters tab will reconfigure to show you the
correct options. Examine your three clusters in Results view (use Folder View and Centroid Table to
understand your clusters). Write answers to the following questions in your Word document:
What can you discern about your at-risk student groups?
Who appears to need the most attention?
Who are the least likely to need additional care?
Take a screen shot of your k-Means Clustering results and paste it into your Word document.
Return to the Process design view. Delete your Clustering operator, then search for and add a
Correlation Matrix operator where the Clustering operator had been. Be sure to connect the ‘mat’ port
to a ‘res’ port, to get the correlation matrix. Click the blue triangle icon to run your Correlation Matrix
model. Note that values close to 1 indicate strong correlations, while values close to 0 indicate no
correlation. Answer the following questions in your Word document:
What attributes of the data set appear to strongly impact other attributes?
What attribute doesn’t appear to have much influence on the other attributes? What might explain
the lack of correlations?
How do you explain the relationship between the ‘detention’ attribute and the ‘suspension’
Purchase answer to see full
Why Choose Us
- 100% non-plagiarized Papers
- 24/7 /365 Service Available
- Affordable Prices
- Any Paper, Urgency, and Subject
- Will complete your papers in 6 hours
- On-time Delivery
- Money-back and Privacy guarantees
- Unlimited Amendments upon request
- Satisfaction guarantee
How it Works
- Click on the “Place Order” tab at the top menu or “Order Now” icon at the bottom and a new page will appear with an order form to be filled.
- Fill in your paper’s requirements in the "PAPER DETAILS" section.
- Fill in your paper’s academic level, deadline, and the required number of pages from the drop-down menus.
- Click “CREATE ACCOUNT & SIGN IN” to enter your registration details and get an account with us for record-keeping and then, click on “PROCEED TO CHECKOUT” at the bottom of the page.
- From there, the payment sections will show, follow the guided payment process and your order will be available for our writing team to work on it.