最近写了写Python-技术圈

关注、星标下方公众号，和你一起成长

大家好，日拱一卒，我是宝器。

今天我们继续来聊伯克利的公开课CS61A，这一次是这门课的第二个大作业。

本项目会手把手带着你使用机器学习算法，对伯克利附近的餐馆根据用户的评价进行聚类，并在网页当中展示。

最后的效果大概是这样的：

虽然项目当中用到的机器学习和Python的知识都比较基础，但整个项目最后的呈现效果还是震撼到了我。这真是我见过最酷炫的Python作业了。

老师替我们完成了所有数据搜集和网页展示的部分，我们要实现的是其中的核心逻辑。

项目网址：https://inst.eecs.berkeley.edu/~cs61a/sp18/proj/maps/#problem-6-2-pt

项目采用的都是Yelp中的真实数据，Yelp可以理解成海外版的大众点评。图中的每一个点表示一家餐馆，颜色代表了类别，类别是需要我们通过机器学习的聚类算法去实现的。所以项目当中也会涉及一些基础的机器学习知识。

完整的项目代码可以点击最下方【阅读原文】跳转我的GitHub。

阶段0 工具方法

Problem 0

实现utils.py，即一些项目需要经常使用的工具函数。

Problem 0.1 Using list comprehensions

list comprehension是如下形式的语法：

[ for  in  if ]

比如我们想要生成10以内所有偶数的平方，我们可以这样操作：

>>> [x * x for x in range(10) if x % 2 == 0]
[0, 4, 16, 36, 64]

从已有的list当中根据filter_fn进行过滤，过滤之后调用map_fn进行映射之后，获得新的list。

需要保证代码只有一行，并且使用了list comprehension.

def map_and_filter(s, map_fn, filter_fn):
    """Returns a new list containing the results of calling map_fn on each
    element of sequence s for which filter_fn returns a true value.

    >>> square = lambda x: x * x
    >>> is_odd = lambda x: x % 2 == 1
    >>> map_and_filter([1, 2, 3, 4, 5], square, is_odd)
    [1, 9, 25]
    """
    # BEGIN Question 0
    return [map_fn(x) for x in s if filter_fn(x)]
    # END Question 0

Problem 0.2: Using `min`

min函数接收一个list，返回其中最小的元素。但它同样可以接收一个匿名函数key，用来自定义元素的排序。这个匿名函数key只能有一个输入，它会被list中的每一个元素调用，它返回的结果将会用来进行比较。

参考如下例子：

>>> min([-1, 0, 1]) # no key argument; smallest input
-1
>>> min([-1, 0, 1], key=lambda x: x*x) # input with the smallest square
0

我们需要实现key_of_min_value函数，它的输入是一个字典d。返回字典中value最小的key。

要求只能使用一行代码，并且使用min函数。

字典在min函数当中作用的元素都是key，所以我们实现一个匿名函数通过key查找value即可。

def key_of_min_value(d):
    """Returns the key in a dict d that corresponds to the minimum value of d.

    >>> letters = {'a': 6, 'b': 5, 'c': 4, 'd': 5}
    >>> min(letters)
    'a'
    >>> key_of_min_value(letters)
    'c'
    """
    # BEGIN Question 0
    return min(d, key=lambda x: d[x])
    # END Question 0

Problem 1

实现mean函数，输入是一个list，返回list的均值。要求使用assert保证序列不为空。

def mean(s):
    """Returns the arithmetic mean of a sequence of numbers s.

    >>> mean([-1, 3])
    1.0
    >>> mean([0, -3, 2, -1])
    -0.5
    """
    # BEGIN Question 1
    assert len(s) > 0, 'empty list'
    return sum(s) / len(s)
    # END Question 1

阶段1 数据抽象

Problem 2

实现abstractions.py文件中restaurant的构造和选择函数。

make_restaurant函数: 返回一个餐馆，它由5个字段组成:
name (a string)
location (a list containing latitude and longitude)
categories (a list of strings)
price (a number)
reviews (a list of review data abstractions created by make_review)
restaurant_name: 返回restaurant名称
restaurant_location: 返回restaurant位置
restaurant_categories: 返回restaurant类别
restaurant_price: 返回restaurant价格
restaurant_ratings: 返回restaurant评分

我们只需要仿照文件中的样例进行实现即可：

def make_restaurant(name, location, categories, price, reviews):
    """Return a restaurant data abstraction containing the name, location,
    categories, price, and reviews for that restaurant."""
    # BEGIN Question 2
    "*** YOUR CODE HERE ***"
    return [name, location, categories, price, reviews]
    # END Question 2

def restaurant_name(restaurant):
    """Return the name of the restaurant, which is a string."""
    # BEGIN Question 2
    "*** YOUR CODE HERE ***"
    return restaurant[0]
    # END Question 2

def restaurant_location(restaurant):
    """Return the location of the restaurant, which is a list containing
    latitude and longitude."""
    # BEGIN Question 2
    "*** YOUR CODE HERE ***"
    return restaurant[1]
    # END Question 2

def restaurant_categories(restaurant):
    """Return the categories of the restaurant, which is a list of strings."""
    # BEGIN Question 2
    "*** YOUR CODE HERE ***"
    return restaurant[2]
    # END Question 2

def restaurant_price(restaurant):
    """Return the price of the restaurant, which is a number."""
    # BEGIN Question 2
    "*** YOUR CODE HERE ***"
    return restaurant[3]
    # END Question 2

def restaurant_ratings(restaurant):
    """Return a list of ratings, which are numbers from 1 to 5, of the
    restaurant based on the reviews of the restaurant."""
    # BEGIN Question 2
    "*** YOUR CODE HERE ***"
    return [review_rating(r) for r in restaurant[4]]
    # END Question 2

实现结束之后，你可以运行命令python3 recommend.py -u来查看指定用户评分过的店。如运行如下命令之后，打开网页内容为：

python3 recommend.py -u one_cluster

阶段2 无监督学习

在这个阶段我们需要使用无监督学习中的聚类算法，将比较接近的餐馆聚成一类。

我们使用的是k-means算法，它是一个经典的无监督聚类算法。它可以将一堆数据按照距离，聚集成k个类别。它的算法原理非常简单，初始化时会从数据当中随意选择k个点作为类簇（类别中心）。然后重复执行两个步骤，直到类簇不再发生变化：

根据距离类簇距离的远近，将样本点分成k个类别
将k个类别中的点的坐标取平均，得到新的类簇

在实现算法的过程当中，可能会遇到一些术语，这里做出解释：

location: 餐厅的坐标，可以表示成经纬度的集合：(latitude, longitude)
centroid: 某个类别的中心坐标（所有点的坐标均值）
restaurant: 餐厅信息的抽象，定义在 abstractions.py 文件中
cluster: 聚集在同一个类别的餐厅list
user: 用户信息的抽象，定义在 abstractions.py文件中
review: 评分信息的抽象, 定义在 abstractions.py文件中
feature function: 特征函数，单个参数函数。以餐厅为输入，返回一个浮点值，比如打分的均值或者是价格的均值

Problem 3

实现find_closest函数，输入一个坐标location和centroid集合（类簇坐标），返回距离location最近的centroid。

需要使用utils.py中的distance函数来计算两个location的距离，如果两个centroid等距离，返回序号较小的那个。

# distance函数
def distance(pos1, pos2):
    """Returns the Euclidean distance between pos1 and pos2, which are pairs.

    >>> distance([1, 2], [4, 6])
    5.0
    """
    return sqrt((pos1[0] - pos2[0]) ** 2 + (pos1[1] - pos2[1]) ** 2)

仿照刚刚min函数的使用方法，我们可以只需要一行代码就可以搞定：

def find_closest(location, centroids):
    """Return the centroid in centroids that is closest to location.
    If multiple centroids are equally close, return the first one.

    >>> find_closest([3.0, 4.0], [[0.0, 0.0], [2.0, 3.0], [4.0, 3.0], [5.0, 5.0]])
    [2.0, 3.0]
    """
    # BEGIN Question 3
    "*** YOUR CODE HERE ***"
    return min(centroids, key=lambda x: distance(location, x))
    # END Question 3

Problem 4

实现group_by_centroid函数，输入是restaurant的list和centroids的list，返回cluster的数组。

每一个cluster是一个restaurant的list，这些restaurant有一个共同点，它们距离某一个特定的centroid比其他的centroid更近。可以无视返回clusters中的顺序。

如果一家restaurant距离多个centroid相同距离，选择序号最小的那个。

提示：使用提供的group_by_first函数实现

group_by_first函数代码如下：

def group_by_first(pairs):
    """Return a list of pairs that relates each unique key in the [key, value]
    pairs to a list of all values that appear paired with that key.

    Arguments:
    pairs -- a sequence of pairs

    >>> example = [ [1, 2], [3, 2], [2, 4], [1, 3], [3, 1], [1, 2] ]
    >>> group_by_first(example)
    [[2, 3, 2], [2, 1], [4]]
    """
    keys = []
    for key, _ in pairs:
        if key not in keys:
            keys.append(key)
    return [[y for x, y in pairs if x == key] for key in keys]

这段函数的输入是一个pair的list，pair是一个的组合。首先提取出pairs当中所有的key，然后再根据key将pair进行分组。

我们的目的是将restaurant根据距离最近的centroid进行分类，有了group_by_first函数之后，我们可以生成[[restaurant, centroid]]形式的数据，调用group_by_first完成目标。

def group_by_centroid(restaurants, centroids):
    """Return a list of clusters, where each cluster contains all restaurants
    nearest to a corresponding centroid in centroids. Each item in
    restaurants should appear once in the result, along with the other
    restaurants closest to the same centroid.
    """
    # BEGIN Question 4
    "*** YOUR CODE HERE ***"
    pairs = [[find_closest(restaurant_location(r), centroids), r] for r in restaurants]
    return group_by_first(pairs)
    # END Question 4

Problem 5

实现find_centroid函数，根据cluster中点的坐标均值计算中心点。

提示：使用utils中的mean函数实现逻辑

def find_centroid(cluster):
    """Return the centroid of the locations of the restaurants in cluster."""
    # BEGIN Question 5
    "*** YOUR CODE HERE ***"
    latitudes, longitudes = [], []
    for c in cluster:
        loc = restaurant_location(c)
        latitudes.append(loc[0])
        longitudes.append(loc[1])
    return [mean(latitudes), mean(longitudes)]
    # END Question 5

Problem 6

实现k_means函数完成kmeans算法，代码中已经实现了算法中的第一个步骤。follow接下来的步骤完成while语句：

将restaurant聚类，每一个类簇中的restaurant最接近的centroid一样
根据聚类的结果，更新centroids

提示：可以使用group_by_centroid和find_centroid函数

其实只要理解了group_by_centroid和find_centroid函数的逻辑，以及kmeans的算法原理之后，不难实现。代码如下：

def k_means(restaurants, k, max_updates=100):
    """Use k-means to group restaurants by location into k clusters."""
    assert len(restaurants) >= k, 'Not enough restaurants to cluster'
    old_centroids, n = [], 0
    # Select initial centroids randomly by choosing k different restaurants
    centroids = [restaurant_location(r) for r in sample(restaurants, k)]

    while old_centroids != centroids and n < max_updates:
        old_centroids = centroids
        # BEGIN Question 6
        "*** YOUR CODE HERE ***"
        pairs = group_by_centroid(restaurants, centroids)
        centroids = [find_centroid(l) for l in pairs]
        # END Question 6
        n += 1
    return centroids

我们可以使用如下命令测试：

python3 recommend.py -k 2

结果如下：

阶段3 有监督学习

在这个阶段，我们要预测用户对于一个餐厅的打分。我们需要使用机器学习中的有监督学习算法，来预测用户对于一家没有去过的餐厅的打分。简单来说，就是从已有的用户打分的数据当中训练一个模型，让模型能够根据历史数据做出尽可能精准的预测。

当完成这个阶段之后，你的可视化结果将会包含所有的餐厅，而非只有用户有过打分的餐厅。

要预测打分，需要使用最小二乘法训练线性回归模型，这是机器学习领域一个常用的回归模型。它可以学习一些特征和待预测结果之间的线性关联，从而做出预测。

算法读入一系列的输入和输出的pair，返回一个斜线，使得它距离所有样本点的距离的平方和最小。

Problem 7

实现find_predictor函数，输入一个user，一系列restaurant和一个特征提取函数feature_fn。

函数返回两个值：predictor和一个r_squared值。使用最小二乘法来计算predictor和r_squared。这个方法描述如下：

计算precitor中的系数a和b，它表示一个线性函数：y = a + bx。r_squared衡量模型和原始数据的精确度。

我们可以使用如下的公式计算三个值：：

Sxx = Σi (xi - mean(x))2
Syy = Σi (yi - mean(y))2
Sxy = Σi (xi - mean(x) (yi - mean(y)

计算出这三个值之后，我们可以代入如下公式计算出a和b以及r_squared：

b = Sxy / Sxx
a = mean(y) - b * mean(x)
R2 = Sxy2 / (Sxx Syy)

提示：可以使用mean和zip函数

其实算法的实现没有难度，公式已经给我们写好了，我们只需要按照公式实现即可。

比较麻烦的其实是这个公式的推导，需要基于最小二乘法和线性回归模型，大家感兴趣可以去补一下这部分内容。

def find_predictor(user, restaurants, feature_fn):
    """Return a rating predictor (a function from restaurants to ratings),
    for a user by performing least-squares linear regression using feature_fn
    on the items in restaurants. Also, return the R^2 value of this model.

    Arguments:
    user -- A user
    restaurants -- A sequence of restaurants
    feature_fn -- A function that takes a restaurant and returns a number
    """
    reviews_by_user = {review_restaurant_name(review): review_rating(review)
                       for review in user_reviews(user).values()}

    xs = [feature_fn(r) for r in restaurants]
    ys = [reviews_by_user[restaurant_name(r)] for r in restaurants]

    # BEGIN Question 7
    x_mean = mean(xs)
    sxx = sum([(x - x_mean)**2 for x in xs])

    y_mean = mean(ys)
    syy = sum([(y - y_mean)**2 for y in ys])
    sxy = sum([(x - x_mean)*(y - y_mean) for x, y in zip(xs, ys)])
    b = sxy / sxx  # REPLACE THIS LINE WITH YOUR SOLUTION
    a = y_mean - b * x_mean
    r_squared = sxy * sxy / (sxx * syy)
    # END Question 7

    def predictor(restaurant):
        return b * feature_fn(restaurant) + a

    return predictor, r_squared

Problem 8

实现best_predictor函数，输入user和restaurant的list，以及feature_fns的list。

它使用每一个特征函数来计算predictor，并且返回r_squared最高的predictor。

所有的predictor都是基于用户有过评论的餐厅列表学习的，所以我们需要使用reviewed函数来进行过滤，这个函数在之前已经实现。

提示：max函数也可以接收一个key参数，和min一样

使用我们刚才开发的find_predictor函数，通过返回的r_squared找出最大的predictor即可。

def best_predictor(user, restaurants, feature_fns):
    """Find the feature within feature_fns that gives the highest R^2 value
    for predicting ratings by the user; return a predictor using that feature.

    Arguments:
    user -- A user
    restaurants -- A list of restaurants
    feature_fns -- A sequence of functions that each takes a restaurant
    """
    reviewed = user_reviewed_restaurants(user, restaurants)
    # BEGIN Question 8
    "*** YOUR CODE HERE ***"
    dt = {}
    for feature_fn in feature_fns:
        predictor, r_squared = find_predictor(user, reviewed, feature_fn)
        dt[predictor] = r_squared
    return max(dt, key=lambda x: dt[x])
    # END Question 8

Problem 9

实现rate_all函数，入参为一个user和一个restaurant的list，以及feature_fns的list。它返回一个dictionary，它的key是restaurant的名称，它的value是评分（rating）。

如果restaurant之前用户评分过，直接使用历史评分，否则使用predictor进行预测。predictor已经使用feature_fns获得。

提示：你可以使用abstractions.py中的user_rating函数

逻辑并不复杂，我们枚举一下restaurant，判断它是否出现在了用户评论的列表中，如果没有则使用predictor进行预测，否则使用user_rating函数获取评分。

def rate_all(user, restaurants, feature_fns):
    """Return the predicted ratings of restaurants by user using the best
    predictor based on a function from feature_fns.

    Arguments:
    user -- A user
    restaurants -- A list of restaurants
    feature_fns -- A sequence of feature functions
    """
    predictor = best_predictor(user, ALL_RESTAURANTS, feature_fns)
    reviewed = user_reviewed_restaurants(user, restaurants)
    # BEGIN Question 9
    "*** YOUR CODE HERE ***"
    return {restaurant_name(r): user_rating(user, restaurant_name(r)) if r in reviewed else predictor(r) for r in restaurants}
    # END Question 9

实现之后，我们可以使用如下命令测试：

python3 recommend.py -u likes_southside -k 5 -p

结果如下：

Problem 10

实现search函数，它的输入是一个query，它是string类型，表示一个类别。返回所有拥有这个类别的restaurant。

提示：我们可以使用list comprehension

每一个restaurant都有categories属性，是若干category的list。我们只需要判断category是否在categories当中即可。

def search(query, restaurants):
    """Return each restaurant in restaurants that has query as a category.

    Arguments:
    query -- A string
    restaurants -- A sequence of restaurants
    """
    # BEGIN Question 10
    "*** YOUR CODE HERE ***"
    return [r for r in restaurants if query in restaurant_categories(r)]
    # END Question 10

实现完成之后，可以使用如下命令测试：

python3 recommend.py -u likes_expensive -k 2 -p -q Sandwiches

结果如下：

如果你愿意，还可以自行在users文件夹当中创建属于你的评分数据来进行演示。

只不过由于我们并不在伯克利，对于伯克利附近的餐厅就更一无所知了，所以也就无从说起……

虽然这个项目看起来很大很复杂，但实际上我整个编码只用了一两个小时，整体的体验非常好。虽然ui页面不是我们写的，但是调试的时候，看到酷炫的结果，还是非常非常有成就感。而且整体的代码质量很不错，非常值得学习。

·················END·················

最近写了写Python

阶段0 工具方法

Problem 0

Problem 0.1 Using list comprehensions

Problem 0.2: Using min

Problem 1

阶段1 数据抽象

Problem 2

阶段2 无监督学习

Problem 3

Problem 4

Problem 5

Problem 6

阶段3 有监督学习

Problem 7

Problem 8

Problem 9

Problem 10

推荐阅读

Problem 0.2: Using `min`