Chapter 3 – Working with Batches of Data

Solutions to exercises in Chapter 3 of Accelerated C++, “Working with batches of data.”

Exercise 3-1

Suppose we wish to find the median of a collection of values. Assume that we have read some of the values so far, and that we have no idea how many values remain to be read. Prove that we cannot afford to discard any of the values that we have read. Hint: One proof strategy is to assume that we can discard a value, and then find values for the unread – and therefore unknown – part of our collection that would cause the median to be the value that we discarded.

expand/collapse icon Hint

There is more than one way to prove this. Consider the definition of median and the steps described in the chapter to calculate it.

expand/collapse icon Solution

Given the following:

  • The median is the numeric value separating the higher half of a set of values from the lower half.
  • The set of values must be sorted (e.g., arranged from lowest value to highest value) in order to determine the median.
  • The midpoint is determined by examining the complete set of sorted values.

It follows that if any previously read items in the set are discarded, and a value is read that changes the midpoint of the set to a number that was previously discarded, the midpoint cannot be accurately calculated.

For example, we will assume a strategy of retaining the middle three numbers of an odd number of inputs and the middle two numbers of an even number of inputs. New inputs will be sorted into the retained set.

The numbers 1, 3, and 5 are read, and the midpoint of this set is determined to be 3. The next number read is 15, so the median may be calculated as the average of 3 and 5 (4). The values 3 and 5 are retained, and the first and last values (1 and 15) are discarded. The next number read is 1, so the new midpoint is assumed to be 3. The numbers 1, 3, and 5 are retained. If the final numbers read are 29, 40, and 33, then the median of the complete set should be 15. This number, however, has been discarded, and cannot be calculated from the numbers that have been retained.

Exercise 3-2

Write a program to compute and print the quartiles (that is, the quarter of the numbers with the largest values, the next highest quarter, and so on) of a set of integers.

expand/collapse icon Hint

Calculate the median on the entire set of integers, then calculate the median of the lower half and the median of the upper half to obtain the quartiles. Also, consider that even though the input values are integers, the median calculations may contain fractional values.

expand/collapse icon Solution

This solution reads integers from the user, but the integers are placed into a vector of double values. The quartile calculations are doubles as well, since a quartile may be the average of two integers such as 2 and 3, yielding a floating-point result of 2.5.

The calculation of the lower and upper quartiles is a little more tricky than the calculation of the median because there are four situations that need to be handled:

  • An odd number of total values with an odd number of values above and below the midpoint.
  • An odd number of total values with an even number of values above and below the midpoint.
  • An even number of total values with an odd number of values above and below the midpoint.
  • An even number of total values with an even number of values above and below the midpoint.

When executed, the program gives the following output:

Exercise 3-3

Write a program to count how many times each distinct word appears in its input.

expand/collapse icon Hint

Use a vector to store the words, then sort the vector. That way you can count the words, in order, as you iterate over the vector.

expand/collapse icon Solution

This solution stores the words in a vector, sorts the vector, then retrieves each word from the vector and increments a count each time the same word is retrieved from the vector. When a word is retrieved from the vector that does not match the previous word, the count is reported for the last word, then reset for the new word.

Note that this program does not account for differences in case.

When executed, the program gives the following output:

Exercise 3-4

Write a program to report the length of the longest and shortest string in its input.

expand/collapse icon Hint

For each word that is input, keep track of the shortest and longest words you’ve encountered.

expand/collapse icon Solution

Create two variables of type string::size_type: one to hold the shortest size, and one to hold the longest size. As the program proceeds through the input loop it first checks to see if shortest_size has been assigned a value other than zero. If not, it assigns it the first string’s size; otherwise it checks to see if the current string is shortest than shortest_size. It then checks to see if the current string is longer than longest_size.

When executed the output looks like the following:

Exercise 3-5

Write a program that will keep track of grades for several students at once. The program could keep two vectors in sync. The first should hold the student’s names, and the second the final grades that can be computed as input is read. For now, you should assume a fixed number of homework grades. We’ll see in §4.1.3/56 how to handle a variable number of grades intermixed with student names.

expand/collapse icon Hint

As suggested, use two vectors. Define a maximum number of grades to request for each student. The grades vector will contain that number of grades multiplied by the number of students entered.

expand/collapse icon Solution

This solution uses two vectors as suggested in the exercise description. The first vector contains the names of each student, and the second vector contains five grades per student. The total number of grades requested for each student is controlled by a constant, max_grades.

This program expects the user to press the enter key after each grade to prevent the program from gathering too many grades. If the user enters more than max_grades numbers on one line, the program will function incorrectly. We will see in later chapters how to avoid this problem.

After the user presses the end-of-file key at the prompt for the next student name, the program will loop through the student_names and student_grades vectors to output each student’s grades and the average of those grades.

When executed the output is as follows:

Exercise 3-6

The average-grade computation in §3.1/36 might divide by zero if the student didn’t enter any grades. Division by zero is undefined in C++, which means that the implementation is permitted to do anything it likes. What does your C++ implementation do in this case? Rewrite the program so that its behavior does not depend on how the implementation treats division by zero.

expand/collapse icon Hint

Look in the code and see which variable could be used as the denominator in division. Test that value against zero prior to the calculation and take appropriate action.

expand/collapse icon Solution

On my Windows 7 64-bit system, running the exercise as compiled by Visual Studio 2010 Professional, pressing end-of-file when the first grades are requested generates the following output:

The odd-looking output “-1.#J” is the result of the division of sum by zero. Internally, this number is “-1.#IND” which means “indeterminate.” Other systems might output “nan” or “NaN,” meaning “not a number.”

To prevent division by zero, test the count variable before using it as the denominator in the division calculation and report the error to the user.

When executed and no grades are entered the output is as follows:

6 thoughts on “Chapter 3 – Working with Batches of Data

  1. Your source for 3-2 doesn’t calculate the quartiles properly when there’s an odd total number of values and an even number of values above and below the midpoint.

  2. omg, your 3.4 exercise answer is compact, simple and accurate.. :O

    my code used vector and kind of long for its simple task, thought so there just some additional to the program, and thnx for the writing titled “George Orwell and Effective Coding”, very inspirational…

    srry for bad grammar, by non-native english person ;)

  3. I think L’s right; there’s a problem with odd totals with even values above and below the mid-point. I had tremendous difficulty with the problem – currently teaching myself maths as well as C++ and really couldn’t figure out how to find the quartiles those particular odd groups. I think there may be a rounding issue with half_med.

    In the end, I resorted to filling two secondary vectors with one half of the original vector in each, and then working out the median of each of those.

    I have a feeling that is horrendously, horrendously inefficient in every way, but it is the only way I’ve been able to find that consistently gives the correct answers consistently.

  4. About 3.2. pheeeeew.
    First, the statistic value “quartile” has two approaches by which it is calculated. If we use the ISO approach, we never include the median value in the calculation of quartiles.

    Second: IMO this is where the mistake is:
    I suspect you dont take into account that the implementation rounds down. 5/2=2 for integers, I am not sure however, what it will do for unsigned vector::sizetype. I suspect it still rounds numbers down by default.

    // find the size of the lower and upper sets
    vec_sz half_size = size % 2 == 0 ? mid : (mid % 2 == 0 ? mid – 1 : mid); //<—– here , you expect the "mid" value to be rounded upwards

  5. Thanks for posting the sample answers!
    I think there’s some problems with 3-3 though.. it doesn’t give me the correct output.

Leave a Reply

Your email address will not be published. Required fields are marked *