Guia passo a passo de John Bullinarias para implementar uma rede neural em C Por John A. Bullinaria da Escola de Ciências da Computação da Universidade de Birmingham, Reino Unido. Este documento contém um guia passo a passo para implementar uma rede neural simples em C. Destina-se principalmente a estudantes que desejam (ou foram informados) incorporar um componente de aprendizagem de rede neural em um sistema maior que estão construindo. Obviamente, existem muitos tipos de rede neural que se poderia considerar usar - aqui devo concentrar-me em um tipo particularmente comum e útil, a saber, uma rede de retro-propagação simples de três camadas (perceptron de várias camadas). Este tipo de rede será útil quando tivermos um conjunto de vetores de entrada e um conjunto correspondente de vetores de saída, e nosso sistema deve produzir uma saída apropriada para cada entrada que é fornecida. Claro, se já possuímos um conjunto completo de livre de ruído de vetores de entrada e saída, uma tabela de consulta simples seria suficiente. No entanto, se queremos que o sistema se generalize. Ou seja, produzir saídas apropriadas para entradas que nunca antes viu, então uma rede neural que aprendeu a mapear entre as entradas e saídas conhecidas (ou seja, nosso conjunto de treinamento) também irá fazer um bom trabalho para novas entradas também. Devo assumir que o leitor já está familiarizado com C e, para obter mais detalhes sobre as redes neurais em geral, basta encaminhar o leitor para as redes internacionais de redes de discussão e as FAQs associadas das Redes Neurais. Então, vamos começar. Um único neurônio (ou seja, unidade de processamento) leva a entrada total In e produz uma saída de saída de ativação. Devo considerar que esta seja a função sigmoide, embora outras funções de ativação sejam freqüentemente usadas (por exemplo, tangente linear ou hiperbólica). Isso tem o efeito de esbarrar o intervalo infinito de In no intervalo de 0 a 1. Ele também possui a conveniente propriedade de que sua derivada toma a forma particularmente simples SigmoidDerivative Sigmoid (1.0 - Sigmoid) Normalmente, a entrada dentro de um neurônio dado será A soma ponderada das ativações de saída que se alimenta de uma série de outras neurônios. É conveniente pensar nas ativações que fluem através de camadas de neurônios. Então, se houver neurônios NumUnits1 na camada 1, a ativação total que flui no nosso neurônio da camada 2 é apenas a soma de Layer1OutiWeighti. Onde Weighti é o peso de força da conexão entre a unidade i na camada 1 e a nossa unidade na camada 2. Cada neurônio também terá um viés ou estado de repouso, que é adicionado à soma das entradas, e é conveniente chamar este peso0 . Podemos então escreverLayer2In Weight0 começar com o viés para (i 1 i lt NumUnits1 i) Layer2In Layer1Outi Weighti adicionar contribuições ponderadas da camada 1 Layer2Out 1.0 (1.0 exp (-Layer2In)) calcular sigmoid para dar ativação Normalmente, a camada 2 terá muitas unidades Também, é apropriado escrever os pesos entre a unidade i na camada 1 e a unidade j na camada 2 como uma matriz Weightij. Assim, para obter a saída da unidade j na camada 2, temos Layer2Inj Weight0j para (i 1 i lt NumUnits1 i) Layer2Inj Layer1Outi Weightij Layer2Outj 1.0 (1.0 exp (-Layer2Inj)) Lembre-se que em C os índices de matriz começam de zero, nem um , Então declararemos nossas variáveis como lousa Layer1OutNumUnits11 Double Layer2InNumUnits21 double Layer2OutNumUnits21 double WeightNumUnits11NumUnits21 (ou, mais provável, declarar ponteiros e usar calloc ou malloc para alocar a memória). Naturalmente, precisamos de outro loop para obter todas as saídas da camada 2 para (j 1 j lt NumUnits2 j) Layer2Inj Weight0j para (i 1 i lt NumUnits1 i) Layer2Inj Layer1Outi Weightij Layer2Outj 1.0 (1.0 exp (-Layer2Inj)) São necessárias redes de três camadas E suficiente para a maioria dos propósitos, então nossas saídas da camada 2 são alimentadas em uma terceira camada do mesmo modo que acima. O código pode começar a confundir neste ponto - acho que manter um índice separado i, j, k para cada camada ajuda, Assim como uma notação intuitiva para distinguir entre as diferentes camadas de pesos Weight12 e Weight23. Por razões óbvias, para redes de três camadas, é tradicional chamar camada 1 da camada de entrada, camada 2 da camada oculta e camada 3 da camada de saída. Nossa rede assume, assim, a forma familiar que devemos usar para o resto deste documento. Além disso, para salvar cada In s and Out s confuso, podemos escrever LayerNIn como SumN. Nosso código pode assim ser escrito. Geralmente, teremos um conjunto completo de padrões de treinamento NumPattern, ou seja, pares de vetores de saída e saída de destino, rotulados pelo índice p. A rede aprende minimizando alguma medida do erro das saídas reais das redes em comparação com as saídas alvo. Por exemplo, o erro de soma quadrada sobre todas as unidades de saída k e todos os padrões de treinamento p serão dados pelo Erro 0.0 para (p 1 p lt NumPattern p) para (k 1 k lt NumOutput k) Erro 0,5 (Targetpk - Outputpk) (Targetpk - Outputpk) (O fator de 0,5 é convencionalmente incluído para simplificar a álgebra na derivação do algoritmo de aprendizado). Se inserimos o código acima para computar as saídas de rede no loop p deste, acabamos com o Ill deixar o leitor dispensar Com quaisquer índices que não precisam para os propósitos de seu próprio sistema (por exemplo, os índices em SumH e SumO). O próximo estágio é ajustar iterativamente os pesos para minimizar o erro das redes. Uma maneira padrão de fazer isso é por descida gradiente na função de erro. Podemos calcular o quanto o erro é alterado por uma pequena alteração em cada peso (ou seja, calcular as derivadas parciais d Erro d Peso) e mudar os pesos por uma pequena quantidade na direção que reduz o erro. A literatura está cheia de variações nesta abordagem geral - devo começar com o padrão de propagação on-line com algoritmo de momentum. Este não é o lugar para passar por todas as matemáticas, mas para o erro de soma quadrada acima, podemos calcular e aplicar uma iteração (ou época) das mudanças de peso necessárias DeltaWeightIH e DeltaWeightHO usando o Erro 0.0 para (p 1 p lt NumPattern p) Para (j 1 j lt NumHidden j) SumHpj WeightIH0j para (i 1 i lt NumInput i) SumHpj Inputpi WeightIHij Hiddenpj 1.0 (1.0 exp (-SumHpj)) para (k 1 k lt NumOutput k) SumOpk WeightHO0k para (j 1 j lt NumHidden j) SumOpk Hiddenpj WeightHOjk Outputpk 1.0 (1.0 exp (-SumOpk)) Erro 0.5 (Targetpk - Outputpk) (Targetpk - Outputpk) DeltaOk (Targetpk - Outputpk) Outputpk (1.0 - Outputpk) para (j 1 j lt NumHidden j) SumDOWj 0,0 para (k 1 k lt NumOutput k) SumDOWj WeightHOjk DeltaOk DeltaHj SumDOWj Hiddenpj (1.0 - Hiddenpj) para (j 1 j lt NumHidden j) DeltaWeightIH0j eta DeltaHj alfa DeltaWeightIH0j PesoIH0j DeltaWeightIH0j para (i 1 i lt NumInput i) DeltaWeightIHij eta Inputpi DeltaHj alpha DeltaWeightIHij WeightIHij DeltaWeightIHij para (k 1 k lt NumOutput k) DeltaWeightHO0k eta DeltaOk alfa DeltaWeightHO0k PesoHO0k DeltaWeightHO0k para (j 1 j lt NumHidden j) DeltaWeightHOjk eta Hiddenpj DeltaOk alfa DeltaWeightHOjk PesoHOjk DeltaWeightHOjk (Há claramente muito espaço para reordenamento, Combinando e simplificando os laços aqui - deixarei isso para o leitor fazer uma vez que tenham entendido o que as seções de código separadas estão fazendo.) As mudanças de peso DeltaWeightIH e DeltaWeightHO são compostas por dois componentes. Primeiro, o componente eta que é a contribuição de descida gradiente. Em segundo lugar, o componente alfa que é um termo de impulso que efetivamente mantém uma média móvel das contribuições de mudança de peso de descida gradiente e, portanto, suaviza as mudanças de peso em geral. A correção de bons valores dos parâmetros de aprendizagem eta e alfa geralmente é uma questão de tentativa e erro. Certamente, o alfa deve estar no intervalo de 0 a 1, e um valor não-zero geralmente acelera o aprendizado. Encontrar um bom valor para eta dependerá do problema e também do valor escolhido para alfa. Se estiver definido muito baixo, o treinamento será desnecessariamente lento. Ter isso muito grande fará com que as mudanças de peso oscilem de forma descontrolada, e podem diminuir a velocidade ou mesmo evitar aprender completamente. (Em geral, comece tentando eta 0.1 e explore os efeitos de duplicar ou dividi-lo repetidamente). O processo de treinamento completo consistirá em repetir as atualizações de peso acima para uma série de épocas (usando outro loop para loop) até que uma avaliação de erro seja atendida, Por exemplo, o erro cai abaixo de um número pequeno escolhido. (Observe que, com sigmoids nas saídas, o erro só pode atingir exatamente zero se os pesos chegarem ao infinito. Observe também que às vezes o treinamento pode ficar preso em um mínimo local da função de erro e nunca chegar a qualquer lugar o mínimo real.) Então, Nós precisamos encerrar o último bloco de código em algo parecido com (Época 1 Época LARGENUMBER epoca) ACIMA DE CÓDIGO PARA UMA ITERAÇÃO se (Erro lt SMALLNUMBER) quebrar Se os padrões de treinamento forem apresentados na mesma ordem sistemática durante cada época, é Possível que as oscilações de peso ocorram. Portanto, geralmente é uma boa idéia usar uma nova ordem aleatória para os padrões de treinamento para cada época. Se colocarmos os índices de padrões de treinamento NumPattern p em ordem aleatória em um array ranpat. Então é simplesmente uma questão de substituir nosso ciclo de padrões de treinamento Gerando a matriz aleatória, ranpat não é tão simples, mas o seguinte código fará o trabalho. Naturalmente, é preciso configurar alguns pesos de rede iniciais para iniciar o processo de aprendizagem. Começar todos os pesos em zero geralmente não é uma boa idéia, pois isso geralmente é um mínimo local da função de erro. É normal inicializar todos os pesos com pequenos valores aleatórios. Se rando () é a sua função de gerador de números aleatórios favoritos que retorna uma distribuição plana de números aleatórios no intervalo de 0 a 1, e smallwt é o tamanho absoluto absoluto de seus pesos iniciais, então uma seção apropriada de código de inicialização de peso seria Note, Que é uma boa idéia configurar todos os DeltaWeights iniciais para zero ao mesmo tempo. Agora temos código suficiente para montar um programa de rede neural de trabalho. Cortei e coloquei o código acima no arquivo nn. c (que seu navegador deve permitir que você salve em seu próprio espaço de arquivo). Eu adicionei o padrão inclui, declarou todas as variáveis, codificou os dados de treinamento XOR padrão e os valores para eta. Alfa e pequeno. Definiu um simples rando (). Adicionou algumas declarações impressas para mostrar o que a rede está fazendo e envolveu todo o lote em um main (). O arquivo deve compilar e executar da maneira normal (por exemplo, usando os comandos UNIX cc nn. c - O - lm - o nn e nn). Eu deixei muito para que o leitor fizesse para converter isso em um programa útil, por exemplo: Leia os dados de treinamento do arquivo Permita os parâmetros (eta. Alpha smallwt. NumHidden etc.) a serem variados durante o tempo de execução. Determinados tamanhos de matriz adequados E alocar a memória durante o tempo de execução. Salvar os pesos para o arquivo e lê-los de novo. Traçar erros, ativações de saída, etc. durante a formação. Existem também inúmeras variações de rede que podem ser implementadas, por exemplo: Aprendizado por lotes, em vez de on - Aprendizado de linha Funções alternativas de ativação (linear, tanh, etc.) As saídas valorizadas reais (em vez de binárias) requerem funções de saída linear Outputpk SumOpk DeltaOk Targetpk - Outputpk Cross-Entropy custo custo em vez de Sum Squared Error Error - (Targetpk log (Outputpk) ( 1.0 - Targetpk) log (1.0 - Outputpk)) DeltaOk Targetpk - Outputpk Conjuntos separados de treinamento, validação e teste Peso decadência Regularização Mas a partir daqui, você está sozinho. Espero que você tenha encontrado esta página útil. Esta página é mantida por John Bullinaria. Última atualização em 18 de novembro de 2002. Usando redes neurais para reconhecer os dígitos manuscritos Perceptrons neurônios sigmóides A arquitetura das redes neurais Uma rede simples para classificar os dígitos manuscritos Aprendendo com descida de gradiente Implementando nossa rede para classificar dígitos Para o aprendizado profundo Como funciona o algoritmo de backpropagação Aquecimento: Uma abordagem rápida baseada em matriz para computar a saída de uma rede neural. As duas premissas que precisamos sobre a função de custo. O produto de Hadamard, s odot t. As quatro equações fundamentais por trás de backpropagation Prova das quatro equações fundamentais (opcional) O algoritmo de backpropagation O código Para backpropagation Em que sentido é backpropagation um algoritmo rápido Backpropagation: o grande quadro Melhorando a forma como as redes neurais aprendem A função de custo de entropia cruzada Overfitting e regularização Inicialização de peso Reconhecimento de manuscrito revisitado: o código Como escolher uma rede neural hiper-parâmetros Outras técnicas A Prova visual que As redes neurais podem calcular qualquer função Duas ressalvas Universalidade com uma entrada e uma saída Muitas variáveis de entrada Extensão além dos neurônios sigmóides Reparando as funções do passo Conclusão Por que as redes neurais profundas são difíceis de treinar O problema do gradiente desaparecendo O que causa o problema do gradiente de descida Gradientes instáveis em profundidade Redes neurais Gradientes instáveis em redes mais complexas Outros obstáculos ao aprendizado profundo Aprendizado profundo Introdução às redes convolutivas Redes neurais convolutivas na prática Código para nossas redes convolutivas Avanço recente no reconhecimento de imagens Outras abordagens para redes neurais profundas Sobre o futuro das redes neurais Apêndice: Existe Um algoritmo simples para a inteligência Obrigado a todos os apoiantes que fizeram o livro possível, com agradecimentos especiais a Pavel Dudrenov. Agradeço também a todos os contribuintes para o Hall of Fame da Bugfinder. Aprendizagem profunda. Livro de rascunho em preparação, por Yoshua Bengio, Ian Goodfellow e Aaron Courville. O sistema visual humano é uma das maravilhas do mundo. Considere a seguinte seqüência de dígitos manuscritos: a maioria das pessoas reconhece facilmente esses dígitos como 504192. Essa facilidade é enganosa. Em cada hemisfério do nosso cérebro, os seres humanos têm um córtex visual primário, também conhecido como V1, contendo 140 milhões de neurônios, com dezenas de bilhões de conexões entre eles. E, no entanto, a visão humana envolve não apenas V1, mas uma série inteira de cortices visuais - V2, V3, V4 e V5 - fazendo progressivamente um processamento de imagem mais complexo. Nós carregamos em nossas cabeças um supercomputador, sintonizado pela evolução ao longo de centenas de milhões de anos, e soberbamente adaptado para entender o mundo visual. Reconhecer os dígitos manuscritos não é fácil. Em vez disso, nós, seres humanos, somos estupendos, surpreendentemente bons em entender o que nossos olhos nos mostram. Mas quase todo esse trabalho é feito inconscientemente. E, portanto, geralmente não apreciamos o quão difícil é o problema dos nossos sistemas visuais. A dificuldade do reconhecimento do padrão visual torna-se evidente se você tentar escrever um programa de computador para reconhecer dígitos como os acima. O que parece fácil quando fazemos nós, de repente, nos torna extremamente difícil. Intuições simples sobre como reconhecemos formas - um 9 tem um loop na parte superior e um curso vertical no canto inferior direito - não é tão simples de expressar algorítmicamente. Quando você tenta fazer essas regras precisas, você se perde rapidamente em um pedaço de exceções e ressalvas e casos especiais. Parece desesperado. As redes neurais abordam o problema de uma maneira diferente. A idéia é tomar uma grande quantidade de dígitos manuscritos, conhecidos como exemplos de treinamento, e então desenvolver um sistema que pode aprender com esses exemplos de treinamento. Em outras palavras, a rede neural usa os exemplos para inferir automaticamente regras para o reconhecimento de dígitos manuscritos. Além disso, ao aumentar o número de exemplos de treinamento, a rede pode aprender mais sobre a caligrafia, e assim melhorar sua precisão. Então, enquanto eu mostrei apenas 100 dígitos de treinamento acima, talvez possamos construir um reconhecedor de caligrafia melhor, usando milhares ou mesmo milhões ou bilhões de exemplos de treinamento. Neste capítulo, escreva bem um programa de computador que implementa uma rede neural que aprende a reconhecer os dígitos manuscritos. O programa tem apenas 74 linhas de comprimento e não usa bibliotecas de redes neurais especiais. Mas este programa curto pode reconhecer dígitos com uma precisão superior a 96%, sem intervenção humana. Além disso, em capítulos posteriores, desenvolver idéias que possam melhorar a precisão em mais de 99%. Na verdade, as melhores redes neurais comerciais são agora tão boas que são usadas pelos bancos para processar cheques e por agências de correio para reconhecer endereços. Concentraram-se no reconhecimento de manuscrito porque é um excelente problema de protótipo para aprender sobre redes neurais em geral. Como um protótipo, ele atinge um ponto doce: é um desafio - não é um feito pequeno para reconhecer os dígitos manuscritos -, mas não é tão difícil como para exigir uma solução extremamente complicada ou um tremendo poder computacional. Além disso, é uma ótima maneira de desenvolver técnicas mais avançadas, como a aprendizagem profunda. E assim, ao longo do livro, retornar repetidamente ao problema do reconhecimento de caligrafia. Mais tarde, no livro, bem, discuta como essas idéias podem ser aplicadas a outros problemas na visão por computador, e também na fala, processamento de linguagem natural e outros domínios. Claro, se o ponto do capítulo fosse apenas escrever um programa de computador para reconhecer os dígitos manuscritos, então o capítulo seria muito mais curto. Mas, ao longo do caminho, desenvolve muitas idéias-chave sobre redes neurais, incluindo dois tipos importantes de neurônio artificial (o Perceptron e neurônio sigmóide) e o algoritmo de aprendizagem padrão para redes neurais, conhecido como descendência de gradiente estocástico. Ao longo disso, eu me concentro em explicar por que as coisas são feitas da maneira que elas são e na construção de sua intuição de redes neurais. Isso requer uma discussão mais longa do que se eu apenas apresentasse a mecânica básica do que está acontecendo, mas vale a pena para a compreensão mais profunda que você alcançará. Entre os retornos, até o final do capítulo, esteja em posição de entender o que é o aprendizado profundo e por que isso importa. O que é uma rede neural Para começar, eu explico um tipo de neurônio artificial chamado perceptron. Os Perceptrons foram desenvolvidos nas décadas de 1950 e 1960 pelo cientista Frank Rosenblatt. Inspirado em trabalhos anteriores de Warren McCulloch e Walter Pitts. Hoje, é mais comum usar outros modelos de neurônios artificiais - neste livro, e em trabalhos muito modernos sobre redes neurais, o principal modelo de neurônio utilizado é o chamado neurônio sigmóide. Bem vindo aos neurônios sigmóides logo. Mas para entender por que os neurônios sigmoides são definidos do jeito que são, vale a pena aproveitar o tempo para entender primeiro perceptrons. Então, como funcionam os perceptrons. Um perceptron leva várias entradas binárias, x1, x2, ldots e produz uma única saída binária: no exemplo mostrado, o perceptron possui três entradas, x1, x2, x3. Em geral, pode ter mais ou menos insumos. Rosenblatt propôs uma regra simples para calcular a saída. Ele introduziu pesos. W1, w2, ldots, números reais que expressam a importância das respectivas entradas para a saída. A saída dos neurônios, 0 ou 1, é determinada por se a soma ponderada sumj wj xj é menor ou maior do que algum valor limiar. Assim como os pesos, o limiar é um número real que é um parâmetro do neurônio. Para colocá-lo em termos algébricos mais precisos: comece mbox deixado 0 mbox sumj wj xj leq mbox 1 mbox sumj wj xj mbox final direito. Tag end Isso é tudo aquilo que funciona um perceptron. É o modelo matemático básico. Uma maneira que você pode pensar sobre o perceptron é que é um dispositivo que toma decisões ao avaliar a evidência. Deixe-me dar um exemplo. Não é um exemplo muito realista, mas é fácil de entender e, em breve, pode chegar a exemplos mais realistas. Suponha que o fim de semana esteja chegando, e você já ouviu falar que haverá um festival de queijos em sua cidade. Você gosta de queijo e está tentando decidir se deve ou não ir ao festival. Você pode tomar sua decisão pesando três fatores: o clima está bom. Seu namorado ou namorada quer acompanhá-lo. É o festival perto do transporte público (Você não possui um carro). Podemos representar estes três fatores pelas variáveis binárias correspondentes x1, x2 e x3. Por exemplo, wed tem x1 1 se o tempo estiver bom e x1 0 se o tempo estiver ruim. Da mesma forma, x2 1 se seu namorado ou namorada quiser ir, e x2 0 se não. E de forma semelhante novamente para x3 e trânsito público. Agora, suponha que você adore o queijo, tanto assim que você está feliz em ir ao festival, mesmo que seu namorado ou namorada não esteja interessado e o festival é difícil de conseguir. Mas talvez você realmente odeie o mau tempo, e não há como ir ao festival se o tempo estiver ruim. Você pode usar perceptrons para modelar esse tipo de tomada de decisão. Uma maneira de fazer isso é escolher um peso w1 6 para o tempo, e w2 2 e w3 2 para as outras condições. O valor maior de w1 indica que o clima é muito importante para você, muito mais do que se seu namorado ou namorada se junta a você ou a proximidade do trânsito público. Finalmente, suponha que você escolha um limite de 5 para o perceptron. Com essas escolhas, o perceptron implementa o modelo de tomada de decisão desejado, produzindo 1 sempre que o tempo está bom e 0 sempre que o tempo estiver ruim. Não faz diferença para o resultado se seu namorado ou namorada quer ir, ou se o transporte público está próximo. Variando os pesos e o limiar, podemos obter diferentes modelos de tomada de decisão. Por exemplo, suponha que nós, em vez disso, escolhemos um limite de 3. Então, o perceptron decidirá que você deveria ir ao festival sempre que o tempo estava bom ou quando o festival estava perto do trânsito público e seu namorado ou namorada estava disposto a se juntar a você. Em outras palavras, deveria ser um modelo diferente de tomada de decisão. Largar o limiar significa que você está mais disposto a ir ao festival. Obviamente, o perceptron não é um modelo completo de tomada de decisão humana. Mas o que o exemplo ilustra é como um perceptron pode pesar diferentes tipos de evidências para tomar decisões. E deve parecer plausível que uma rede complexa de perceptrons possa tomar decisões bastante sutis: nesta rede, a primeira coluna de perceptrons - o que bem chamar a primeira camada de perceptrons - está fazendo três decisões muito simples, pesando a evidência de entrada. E quanto aos perceptrons na segunda camada Cada um desses perceptrons está tomando uma decisão ponderando os resultados da primeira camada de tomada de decisão. Desta forma, um perceptron na segunda camada pode tomar uma decisão em um nível mais complexo e mais abstrato do que perceptrons na primeira camada. E as decisões ainda mais complexas podem ser feitas pelo perceptron na terceira camada. Desta forma, uma rede de perceptrons de muitas camadas pode se envolver em uma tomada de decisão sofisticada. Aliás, quando eu defini perceptrons, eu disse que um perceptron possui apenas um único resultado. Na rede acima, os perceptrons parecem ter múltiplos resultados. Na verdade, eles ainda são de saída única. As setas de saída múltiplas são apenas uma maneira útil de indicar que a saída de um perceptron está sendo usada como entrada para vários outros perceptrons. É menos difícil de desenhar que uma única linha de saída que depois se divide. Permite simplificar a maneira como descrevemos o perceptrons. A condição sumj wj xj mbox é pesada, e podemos fazer duas mudanças de notação para simplificá-la. A primeira alteração é escrever sumj wj xj como um produto ponto, w cdot x equiv sumj wj xj, onde w e x são vetores cujos componentes são os pesos e entradas, respectivamente. A segunda mudança é mover o limiar para o outro lado da desigualdade e substituí-lo pelo que é conhecido como o viés de perceptrons. B equiv - mbox. Usando o viés em vez do limite, a regra perceptron pode ser reescrita: comece mbox esquerda 0 mbox wcdot x b leq 0 1 mbox wcdot x b 0 termina direito. Tag end Você pode pensar no viés como uma medida de quão fácil é obter o perceptron para produzir um 1. Ou para colocá-lo em termos mais biológicos, o viés é uma medida de quão fácil é fazer com que o perceptron dispare . Para um perceptron com um viés realmente grande, é extremamente fácil para o perceptron produzir um 1. Mas se o viés é muito negativo, então é difícil para o perceptron emitir um 1. Obviamente, introduzir o viés é apenas uma pequena mudança em Como descrevemos os perceptrons, mas vejo mais adiante que isso leva a mais simplificações notação. Por causa disto, no restante do livro, não usamos o limite, nem sempre use o viés. Eu descrevi os perceptrons como um método para pesar evidências para tomar decisões. Outra maneira que os perceptrons podem ser usados é calcular as funções lógicas elementares que geralmente pensamos como computação subjacente, funções como AND. OU. E NAND. Por exemplo, suponha que possamos um perceptron com duas entradas, cada uma com peso -2, e um viés geral de 3. O nosso perceptron: então vemos que a entrada 00 produz a saída 1, uma vez que (-2) 0 (-2) 03 3 é positivo. Aqui, eu introduzi o símbolo para tornar as multiplicações explícitas. Cálculos similares mostram que as entradas 01 e 10 produzem saída 1. Mas a entrada 11 produz a saída 0, uma vez que (-2) 1 (-2) 13 -1 é negativo. E assim nosso perceptron implementa um portão NAND O exemplo NAND mostra que podemos usar perceptrons para calcular funções lógicas simples. Na verdade, podemos usar redes de perceptrons para calcular qualquer função lógica. A razão é que o portão NAND é universal para computação, ou seja, podemos construir qualquer computação para fora dos portões NAND. Por exemplo, podemos usar portas NAND para construir um circuito que adicione dois bits, x1 e x2. Isso requer o cálculo da soma bit a bit, x1 oplus x2, bem como um bit de transporte que é definido como 1 quando ambos x1 e x2 são 1, ou seja, o bit de transporte é apenas o produto bit a bit x1 x2: Para obter uma rede equivalente de perceptrons, nós Substitua todos os portões NAND por perceptrons com duas entradas, cada uma com peso -2 e um viés geral de 3. A rede resultante é. Observe que Ive movido o perceptron correspondente ao portão NAND inferior direito um pouco, apenas para tornar mais fácil desenhar as setas no diagrama: Um aspecto notável desta rede de perceptrons é que a saída do perceptron mais à esquerda é usada duas vezes como entrada Para o perceptron mais baixo. Quando eu defini o modelo perceptron, não disse se era permitido esse tipo de dupla saída para o mesmo lugar. Na verdade, não importa muito. Se não quisermos permitir esse tipo de coisa, é possível simplesmente combinar as duas linhas, em uma única conexão com um peso de -4 em vez de duas conexões com -2 pesos. (Se você não achar isso óbvio, você deve parar e provar para si mesmo que isso é equivalente.) Com essa mudança, a rede parece da seguinte maneira, com todos os pesos não marcados iguais a -2, todos os preconceitos igual a 3 e um único peso De -4, como marcado: até agora tenho desenhado entradas como x1 e x2 como variáveis flutuando para a esquerda da rede de perceptrons. De fato, é convencional desenhar uma camada extra de perceptrons - a camada de entrada - para codificar as entradas: Esta notação para perceptrons de entrada, na qual temos uma saída, mas sem entradas, é uma abreviatura. Não significa realmente um perceptron sem entradas. Para ver isso, suponha que tivéssemos um perceptron sem entradas. Então, a soma ponderada sumj wj xj seria sempre zero e, portanto, o perceptron emitiria 1 se b 0 e 0 se b leq 0. Ou seja, o perceptron simplesmente emitiria um valor fixo, e não o valor desejado (x1, em O exemplo acima). É melhor pensar nos perceptrons de entrada como não sendo realmente perceptrons, mas sim unidades especiais que são simplesmente definidas para produzir os valores desejados, x1, x2, ldots. O exemplo do adder demonstra como uma rede de perceptrons pode ser usada para simular um circuito contendo muitos portões NAND. E porque os portões NAND são universais para computação, segue-se que os perceptrons também são universais para a computação. A universalidade computacional dos perceptrons é simultaneamente reconfortante e decepcionante. É reconfortante porque nos diz que redes de perceptrons podem ser tão poderosas quanto qualquer outro dispositivo de computação. Mas também é decepcionante, porque parece que os perceptrons são meramente um novo tipo de portão NAND. Isso é dificilmente grande notícia No entanto, a situação é melhor do que esta visão sugere. Acontece que podemos conceber algoritmos de aprendizagem que podem ajustar automaticamente os pesos e os vícios de uma rede de neurônios artificiais. Esse ajuste ocorre em resposta a estímulos externos, sem intervenção direta de um programador. Esses algoritmos de aprendizagem nos permitem usar neurônios artificiais de uma maneira que é radicalmente diferente dos portões da lógica convencional. Em vez de arquivar explicitamente um circuito de NAND e outros portões, nossas redes neurais podem simplesmente aprender a resolver problemas, às vezes problemas em que seria extremamente difícil projetar diretamente um circuito convencional. Os algoritmos de aprendizagem são fantásticos. Mas como podemos conceber tais algoritmos para uma rede neural. Suponhamos que possamos uma rede de perceptrons que se casam para usar para aprender a resolver algum problema. Por exemplo, as entradas para a rede podem ser os dados de pixel em bruto a partir de uma imagem digitalizada digitalizada de um dígito. E se casar com a rede para aprender pesos e preconceitos para que a saída da rede classifique corretamente o dígito. Para ver como a aprendizagem pode funcionar, suponha que façamos uma pequena alteração em algum peso (ou parcial) na rede. O que é parecido com esta pequena mudança de peso causa apenas uma pequena alteração correspondente na saída da rede. Além disso, veja em um momento, esta propriedade tornará possível a aprendizagem. Schematically, heres o que queremos (obviamente, esta rede é muito simples para fazer reconhecimento de escrita): se fosse verdade que uma pequena alteração em um peso (ou parcial) faz com que apenas uma pequena alteração na saída, então poderíamos usar esse fato para modificar Os pesos e os preconceitos para que nossa rede se comporte mais da maneira que queremos. Por exemplo, suponha que a rede classificasse equivocadamente uma imagem como um 8 quando deveria ser um 9. Podemos descobrir como fazer uma pequena mudança nas ponderações e polarizações para que a rede fique um pouco mais perto de classificar a imagem como um 9 . E, em seguida, marque repetir isso, alterando os pesos e os viés mais e mais para produzir uma melhor e melhor saída. A rede estaria aprendendo. O problema é que isso não é o que acontece quando nossa rede contém perceptrons. De fato, uma pequena alteração nos pesos ou em qualquer percepção da percepção na rede pode, por vezes, fazer com que a saída desse perceptron flip completamente, digamos de 0 a 1. Esse flip pode então causar o comportamento do resto da rede para Mudar completamente de uma maneira muito complicada. Então, enquanto o seu 9 agora pode ser classificado corretamente, o comportamento da rede em todas as outras imagens provavelmente mudará completamente de maneira difícil de controlar. Isso torna difícil ver como modificar gradualmente os pesos e os preconceitos para que a rede se aproxime do comportamento desejado. Talvez haja alguma maneira inteligente de resolver esse problema. Mas não é imediatamente óbvio como podemos obter uma rede de perceptrons para aprender. We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. Thats the crucial fact which will allow a network of sigmoid neurons to learn. Okay, let me describe the sigmoid neuron. Well depict sigmoid neurons in the same way we depicted perceptrons: Just like a perceptron, the sigmoid neuron has inputs, x1, x2, ldots. But instead of being just 0 or 1, these inputs can also take on any values between 0 and 1. So, for instance, 0.638ldots is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input, w1, w2, ldots, and an overall bias, b. But the output is not 0 or 1. Instead, its sigma(w cdot xb), where sigma is called the sigmoid function Incidentally, sigma is sometimes called the logistic function . and this new class of neurons called logistic neurons . Its useful to remember this terminology, since these terms are used by many people working with neural nets. However, well stick with the sigmoid terminology. and is defined by: begin sigma(z) equiv frac . tag end To put it all a little more explicitly, the output of a sigmoid neuron with inputs x1,x2,ldots, weights w1,w2,ldots, and bias b is begin frac . tag end At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if youre not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding. To understand the similarity to the perceptron model, suppose z equiv w cdot x b is a large positive number. Then e approx 0 and so sigma(z) approx 1. In other words, when z w cdot xb is large and positive, the output from the sigmoid neuron is approximately 1, just as it would have been for a perceptron. Suppose on the other hand that z w cdot xb is very negative. Then e rightarrow infty, and sigma(z) approx 0. So when z w cdot x b is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. Its only when w cdot xb is of modest size that theres much deviation from the perceptron model. What about the algebraic form of sigma How can we understand that In fact, the exact form of sigma isnt so important - what really matters is the shape of the function when plotted. Heres the shape: This shape is a smoothed out version of a step function: If sigma had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be 1 or 0 depending on whether wcdot xb was positive or negative Actually, when w cdot x b 0 the perceptron outputs 0, while the step function outputs 1. So, strictly speaking, wed need to modify the step function at that one point. But you get the idea. By using the actual sigma function we get, as already implied above, a smoothed out perceptron. Indeed, its the smoothness of the sigma function that is the crucial fact, not its detailed form. The smoothness of sigma means that small changes Delta wj in the weights and Delta b in the bias will produce a small change Delta mbox in the output from the neuron. In fact, calculus tells us that Delta mbox is well approximated by begin Delta mbox approx sumj frac Delta wj frac Delta b, tag end where the sum is over all the weights, wj, and partial , mbox partial wj and partial , mbox partial b denote partial derivatives of the mbox with respect to wj and b, respectively. Dont panic if youre not comfortable with partial derivatives While the expression above looks complicated, with all the partial derivatives, its actually saying something very simple (and which is very good news): Delta mbox is a linear function of the changes Delta wj and Delta b in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output. If its the shape of sigma which really matters, and not its exact form, then why use the particular form used for sigma in Equation (3) begin sigma(z) equiv frac nonumberend . In fact, later in the book we will occasionally consider neurons where the output is f(w cdot x b) for some other activation function f(cdot). The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5) begin Delta mbox approx sumj frac Delta wj frac Delta b nonumberend change. It turns out that when we compute those partial derivatives later, using sigma will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case, sigma is commonly-used in work on neural nets, and is the activation function well use most often in this book. How should we interpret the output from a sigmoid neuron Obviously, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons dont just output 0 or 1. They can have as output any real number between 0 and 1, so values such as 0.173ldots and 0.689ldots are legitimate outputs. This can be useful, for example, if we want to use the output value to represent the average intensity of the pixels in an image input to a neural network. But sometimes it can be a nuisance. Suppose we want the output from the network to indicate either the input image is a 9 or the input image is not a 9. Obviously, itd be easiest to do this if the output was a 0 or a 1, as in a perceptron. But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least 0.5 as indicating a 9, and any output less than 0.5 as indicating not a 9. Ill always explicitly state when were using such a convention, so it shouldnt cause any confusion. Sigmoid neurons simulating perceptrons, part I mbox Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, c 0. Show that the behaviour of the network doesnt change. Sigmoid neurons simulating perceptrons, part II mbox Suppose we have the same setup as the last problem - a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We wont need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such that w cdot x b neq 0 for the input x to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant c 0. Show that in the limit as c rightarrow infty the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when w cdot x b 0 for one of the perceptrons In the next section Ill introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network: As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons . The rightmost or output layer contains the output neurons . or, as in this case, a single output neuron. The middle layer is called a hidden layer . since the neurons in this layer are neither inputs nor outputs. The term hidden perhaps sounds a little mysterious - the first time I heard the term I thought it must have some deep philosophical or mathematical significance - but it really means nothing more than not an input or an output. The network above has just a single hidden layer, but some networks have multiple hidden layers. For example, the following four-layer network has two hidden layers: Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs . despite being made up of sigmoid neurons, not perceptrons. Im not going to use the MLP terminology in this book, since I think its confusing, but wanted to warn you of its existence. The design of the input and output layers in a network is often straightforward. For example, suppose were trying to determine whether a handwritten image depicts a 9 or not. A natural way to design the network is to encode the intensities of the image pixels into the input neurons. If the image is a 64 by 64 greyscale image, then wed have 4,096 64 times 64 input neurons, with the intensities scaled appropriately between 0 and 1. The output layer will contain just a single neuron, with output values of less than 0.5 indicating input image is not a 9, and values greater than 0.5 indicating input image is a 9 . While the design of the input and output layers of a neural network is often straightforward, there can be quite an art to the design of the hidden layers. In particular, its not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network. Well meet several such design heuristics later in this book. Up to now, weve been discussing neural networks where the output from one layer is used as input to the next layer. Such networks are called feedforward neural networks. This means there are no loops in the network - information is always fed forward, never fed back. If we did have loops, wed end up with situations where the input to the sigma function depended on the output. Thatd be hard to make sense of, and so we dont allow such loops. However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops dont cause problems in such a model, since a neurons output only affects its input at some later time, not instantaneously. Recurrent neural nets have been less influential than feedforward networks, in part because the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent networks are still extremely interesting. Theyre much closer in spirit to how our brains work than feedforward networks. And its possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks. However, to limit our scope, in this book were going to concentrate on the more widely-used feedforward networks. Having defined neural networks, lets return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, wed like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, wed like to break the image into six separate images, We humans solve this segmentation problem with ease, but its challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, wed like our program to recognize that the first digit above, Well focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then its probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation well concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits. To recognize individual digits we will use a three-layer neural network: The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many 28 by 28 pixel images of scanned handwritten digits, and so the input layer contains 784 28 times 28 neurons. For simplicity Ive omitted most of the 784 input neurons in the diagram above. The input pixels are greyscale, with a value of 0.0 representing white, a value of 1.0 representing black, and in between values representing gradually darkening shades of grey. The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by n, and well experiment with different values for n. The example shown illustrates a small hidden layer, containing just n 15 neurons. The output layer of the network contains 10 neurons. If the first neuron fires, i. e. has an output approx 1, then that will indicate that the network thinks the digit is a 0. If the second neuron fires then that will indicate that the network thinks the digit is a 1. And so on. A little more precisely, we number the output neurons from 0 through 9, and figure out which neuron has the highest activation value. If that neuron is, say, neuron number 6, then our network will guess that the input digit was a 6. And so on for the other output neurons. You might wonder why we use 10 output neurons. After all, the goal of the network is to tell us which digit (0, 1, 2, ldots, 9) corresponds to the input image. A seemingly natural way of doing that is to use just 4 output neurons, treating each neuron as taking on a binary value, depending on whether the neurons output is closer to 0 or to 1. Four neurons are enough to encode the answer, since 24 16 is more than the 10 possible values for the input digit. Why should our network use 10 neurons instead Isnt that inefficient The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with 10 output neurons learns to recognize digits better than the network with 4 output neurons. But that leaves us wondering why using 10 output neurons works better. Is there some heuristic that would tell us in advance that we should use the 10-output encoding instead of the 4-output encoding To understand why we do this, it helps to think about what the neural network is doing from first principles. Consider first the case where we use 10 output neurons. Lets concentrate on the first output neuron, the one thats trying to decide whether or not the digit is a 0. It does this by weighing up evidence from the hidden layer of neurons. What are those hidden neurons doing Well, just suppose for the sake of argument that the first neuron in the hidden layer detects whether or not an image like the following is present: It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, lets suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present: As you may have guessed, these four images together make up the 0 image that we saw in the line of digits shown earlier : So if all four of these hidden neurons are firing then we can conclude that the digit is a 0. Of course, thats not the only sort of evidence we can use to conclude that the image was a 0 - we could legitimately get a 0 in many other ways (say, through translations of the above images, or slight distortions). But it seems safe to say that at least in this case wed conclude that the input was a 0. Supposing the neural network functions in this way, we can give a plausible explanation for why its better to have 10 outputs from the network, rather than 4. If we had 4 outputs, then the first output neuron would be trying to decide what the most significant bit of the digit was. And theres no easy way to relate that most significant bit to simple shapes like those shown above. Its hard to imagine that theres any good historical reason the component shapes of the digit will be closely related to (say) the most significant bit in the output. Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes. Maybe a clever learning algorithm will find some assignment of weights that lets us use only 4 output neurons. But as a heuristic the way of thinking Ive described works pretty well, and can save you a lot of time in designing good neural network architectures. There is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 3 layers of neurons are such that the correct output in the third layer (i. e. the old output layer) has activation at least 0.99, and incorrect outputs have activation less than 0.01. Now that we have a design for our neural network, how can it learn to recognize digits The first thing well need is a data set to learn from - a so-called training data set. Well use the MNIST data set. which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNISTs name comes from the fact that it is a modified subset of two data sets collected by NIST. the United States National Institute of Standards and Technology. Heres a few images from MNIST: As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network well ask it to recognize images which arent in the training set The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. Well use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didnt see during training. Well use the notation x to denote a training input. Itll be convenient to regard each training input x as a 28 times 28 784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. Well denote the corresponding desired output by y y(x), where y is a 10-dimensional vector. For example, if a particular training image, x, depicts a 6, then y(x) (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)T is the desired output from the network. Note that T here is the transpose operation, turning a row vector into an ordinary (column) vector. What wed like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x) for all training inputs x. To quantify how well were achieving this goal we define a cost function Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since its often used in research papers and other discussions of neural networks. begin C(w, b) equiv frac sumx y(x) - a2. tag end Here, w denotes the collection of all weights in the network, b all the biases, n is the total number of training inputs, a is the vector of outputs from the network when x is input, and the sum is over all training inputs, x. Of course, the output a depends on x, w and b, but to keep the notation simple I havent explicitly indicated this dependence. The notation v just denotes the usual length function for a vector v. Well call C the quadratic cost function its also sometimes known as the mean squared error or just MSE . Inspecting the form of the quadratic cost function, we see that C(w, b) is non-negative, since every term in the sum is non-negative. Furthermore, the cost C(w, b) becomes small, i. e. C(w, b) approx 0, precisely when y(x) is approximately equal to the output, a, for all training inputs, x. So our training algorithm has done a good job if it can find weights and biases so that C(w, b) approx 0. By contrast, its not doing so well when C(w, b) is large - that would mean that y(x) is not close to the output a for a large number of inputs. So the aim of our training algorithm will be to minimize the cost C(w, b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. Well do that using an algorithm known as gradient descent . Why introduce the quadratic cost After all, arent we primarily interested in the number of images correctly classified by the network Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases wont cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. Thats why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy. Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6) begin C(w, b) equiv frac sumx y(x) - a2 nonumberend . Isnt this a rather ad hoc choice Perhaps if we chose a different cost function wed get a totally different set of minimizing weights and biases This is a valid concern, and later well revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6) begin C(w, b) equiv frac sumx y(x) - a2 nonumberend works perfectly well for understanding the basics of learning in neural networks, so well stick with it for now. Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w, b). This is a well-posed problem, but its got a lot of distracting structure as currently posed - the interpretation of w and b as weights and biases, the sigma function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now were going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, were going to imagine that weve simply been given a function of many variables and we want to minimize that function. Were going to develop a technique called gradient descent which can be used to solve such minimization problems. Then well come back to the specific function we want to minimize for neural networks. Okay, lets suppose were trying to minimize some function, C(v). This could be any real-valued function of many variables, v v1, v2, ldots. Note that Ive replaced the w and b notation by v to emphasize that this could be any function - were not specifically thinking in the neural networks context any more. To minimize C(v) it helps to imagine C as a function of just two variables, which well call v1 and v2: What wed like is to find where C achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, Ive perhaps shown slightly too simple a function A general function, C, may be a complicated function of many variables, and it wont usually be possible to just eyeball the graph to find the minimum. One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where C is an extremum. With some luck that might work when C is a function of just one or a few variables. But itll turn into a nightmare when we have many more variables. And for neural networks well often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just wont work (After asserting that well gain insight by imagining C as a function of just two variables, Ive turned around twice in two paragraphs and said, hey, but what if its a function of many more than two variables Sorry about that. Please believe me when I say that it really does help to imagine C as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when its appropriate to use each picture, and when its not.) Okay, so calculus doesnt work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldnt be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function Wed randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of C - those derivatives would tell us everything we need to know about the local shape of the valley, and therefore how our ball should roll. Based on what Ive just written, you might suppose that well be trying to write down Newtons equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, were not going to take the ball-rolling analogy quite that seriously - were devising an algorithm to minimize C, not developing an accurate simulation of the laws of physics The balls-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, lets simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley To make this question more precise, lets think about what happens when we move the ball a small amount Delta v1 in the v1 direction, and a small amount Delta v2 in the v2 direction. Calculus tells us that C changes as follows: begin Delta C approx frac Delta v1 frac Delta v2. tag end Were going to find a way of choosing Delta v1 and Delta v2 so as to make Delta C negative i. e. well choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define Delta v to be the vector of changes in v, Delta v equiv (Delta v1, Delta v2)T, where T is again the transpose operation, turning row vectors into column vectors. Well also define the gradient of C to be the vector of partial derivatives, left(frac , frac right)T. We denote the gradient vector by nabla C, i. e. begin nabla C equiv left( frac , frac right)T. tag end In a moment well rewrite the change Delta C in terms of Delta v and the gradient, nabla C. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the nabla C notation for the first time, people sometimes wonder how they should think about the nabla symbol. What, exactly, does nabla mean In fact, its perfectly fine to think of nabla C as a single mathematical object - the vector defined above - which happens to be written using two symbols. In this point of view, nabla is just a piece of notational flag-waving, telling you hey, nabla C is a gradient vector. There are more advanced points of view where nabla can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we wont need such points of view. With these definitions, the expression (7) begin Delta C approx frac Delta v1 frac Delta v2 nonumberend for Delta C can be rewritten as begin Delta C approx nabla C cdot Delta v. tag end This equation helps explain why nabla C is called the gradient vector: nabla C relates changes in v to changes in C, just as wed expect something called a gradient to do. But whats really exciting about the equation is that it lets us see how to choose Delta v so as to make Delta C negative. In particular, suppose we choose begin Delta v - eta nabla C, tag end where eta is a small, positive parameter (known as the learning rate ). Then Equation (9) begin Delta C approx nabla C cdot Delta v nonumberend tells us that Delta C approx - eta nabla C cdot nabla C - eta nabla C2. Because nabla C 2 geq 0, this guarantees that Delta C leq 0, i. e. C will always decrease, never increase, if we change v according to the prescription in (10) begin Delta v - eta nabla C nonumberend . (Within, of course, the limits of the approximation in Equation (9) begin Delta C approx nabla C cdot Delta v nonumberend ). This is exactly the property we wanted And so well take Equation (10) begin Delta v - eta nabla C nonumberend to define the law of motion for the ball in our gradient descent algorithm. That is, well use Equation (10) begin Delta v - eta nabla C nonumberend to compute a value for Delta v, then move the balls position v by that amount: begin v rightarrow v v - eta nabla C. tag end Then well use this update rule again, to make another move. If we keep doing this, over and over, well keep decreasing C until - we hope - we reach a global minimum. Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient nabla C, and then to move in the opposite direction, falling down the slope of the valley. We can visualize it like this: Notice that with this rule gradient descent doesnt reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. Its only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing Delta v just says go down, right now. Thats still a pretty good rule for finding the minimum To make gradient descent work correctly, we need to choose the learning rate eta to be small enough that Equation (9) begin Delta C approx nabla C cdot Delta v nonumberend is a good approximation. If we dont, we might end up with Delta C 0, which obviously would not be good At the same time, we dont want eta to be too small, since that will make the changes Delta v tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, eta is often varied so that Equation (9) begin Delta C approx nabla C cdot Delta v nonumberend remains a good approximation, but the algorithm isnt too slow. Well see later how this works. Ive explained gradient descent when C is a function of just two variables. But, in fact, everything works just as well even when C is a function of many more variables. Suppose in particular that C is a function of m variables, v1,ldots, vm. Then the change Delta C in C produced by a small change Delta v (Delta v1, ldots, Delta vm)T is begin Delta C approx nabla C cdot Delta v, tag end where the gradient nabla C is the vector begin nabla C equiv left(frac , ldots, frac right)T. tag end Just as for the two variable case, we can choose begin Delta v - eta nabla C, tag end and were guaranteed that our (approximate) expression (12) begin Delta C approx nabla C cdot Delta v nonumberend for Delta C will be negative. This gives us a way of following the gradient to a minimum, even when C is a function of many variables, by repeatedly applying the update rule begin v rightarrow v v-eta nabla C. tag end You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position v in order to find a minimum of the function C. The rule doesnt always work - several things can go wrong and prevent gradient descent from finding the global minimum of C, a point well return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks well find that its a powerful way of minimizing the cost function, and so helping the net learn. Indeed, theres even a sense in which gradient descent is the optimal strategy for searching for a minimum. Lets suppose that were trying to make a move Delta v in position so as to decrease C as much as possible. This is equivalent to minimizing Delta C approx nabla C cdot Delta v. Well constrain the size of the move so that Delta v epsilon for some small fixed epsilon 0. In other words, we want a move that is a small step of a fixed size, and were trying to find the movement direction which decreases C as much as possible. It can be proved that the choice of Delta v which minimizes nabla C cdot Delta v is Delta v - eta nabla C, where eta epsilon nabla C is determined by the size constraint Delta v epsilon. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease C. Prove the assertion of the last paragraph. Hint: If youre not already familiar with the Cauchy-Schwarz inequality. you may find it helpful to familiarize yourself with it. I explained gradient descent when C is a function of two variables, and when its a function of more than two variables. What happens when C is a function of just one variable Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of C, and this can be quite costly. To see why its costly, suppose we want to compute all the second partial derivatives partial2 C partial vj partial vk. If there are a million such vj variables then wed need to compute something like a trillion (i. e. a million squared) second partial derivatives Actually, more like half a trillion, since partial2 C partial vj partial vk partial2 C partial vk partial vj. Still, you get the point. Thats going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book well use gradient descent (and variations) as our main approach to learning in neural networks. How can we apply gradient descent to learn in a neural network The idea is to use gradient descent to find the weights wk and biases bl which minimize the cost in Equation (6) begin C(w, b) equiv frac sumx y(x) - a2 nonumberend . To see how this works, lets restate the gradient descent update rule, with the weights and biases replacing the variables vj. In other words, our position now has components wk and bl, and the gradient vector nabla C has corresponding components partial C partial wk and partial C partial bl. Writing out the gradient descent update rule in terms of components, we have begin wk rightarrow wk wk-eta frac tag bl rightarrow bl bl-eta frac . tag end By repeatedly applying this update rule we can roll down the hill, and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network. There are a number of challenges in applying the gradient descent rule. Well look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, lets look back at the quadratic cost in Equation (6) begin C(w, b) equiv frac sumx y(x) - a2 nonumberend . Notice that this cost function has the form C frac sumx Cx, that is, its an average over costs Cx equiv frac for individual training examples. In practice, to compute the gradient nabla C we need to compute the gradients nabla Cx separately for each training input, x, and then average them, nabla C frac sumx nabla Cx. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly. An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient nabla C by computing nabla Cx for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient nabla C, and this helps speed up gradient descent, and thus learning. To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number m of randomly chosen training inputs. Well label those random training inputs X1, X2, ldots, Xm, and refer to them as a mini-batch . Provided the sample size m is large enough we expect that the average value of the nabla C will be roughly equal to the average over all nabla Cx, that is, begin frac m nabla C approx frac nabla C, tag end where the second sum is over the entire set of training data. Swapping sides we get begin nabla C approx frac sum m nabla C , tag end confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch. To connect this explicitly to learning in neural networks, suppose wk and bl denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, begin wk rightarrow wk wk-frac sumj frac tag bl rightarrow bl bl-frac sumj frac , tag end where the sums are over all the training examples Xj in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until weve exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch. Incidentally, its worth noting that conventions vary about scaling of the cost function and of mini-batch updates to the weights and biases. In Equation (6) begin C(w, b) equiv frac sumx y(x) - a2 nonumberend we scaled the overall cost function by a factor frac . People sometimes omit the frac , summing over the costs of individual training examples instead of averaging. This is particularly useful when the total number of training examples isnt known in advance. This can occur if more training data is being generated in real time, for instance. And, in a similar way, the mini-batch update rules (20) begin wk rightarrow wk wk-frac sumj frac nonumberend and (21) begin bl rightarrow bl bl-frac sumj frac nonumberend sometimes omit the frac term out the front of the sums. Conceptually this makes little difference, since its equivalent to rescaling the learning rate eta. But when doing detailed comparisons of different work its worth watching out for. We can think of stochastic gradient descent as being like political polling: its much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election. For example, if we have a training set of size n 60,000, as in MNIST, and choose a mini-batch size of (say) m 10, this means well get a factor of 6,000 speedup in estimating the gradient Of course, the estimate wont be perfect - there will be statistical fluctuations - but it doesnt need to be perfect: all we really care about is moving in a general direction that will help decrease C, and that means we dont need an exact computation of the gradient. In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and its the basis for most of the learning techniques well develop in this book. An extreme version of gradient descent is to use a mini-batch size of just 1. That is, given a training input, x, we update our weights and biases according to the rules wk rightarrow wk wk - eta partial Cx partial wk and bl rightarrow bl bl - eta partial Cx partial bl. Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online . on-line . or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, 20. Let me conclude this section by discussing a point that sometimes bugs people new to gradient descent. In neural networks the cost C is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space. Some people get hung up thinking: Hey, I have to be able to visualize all these extra dimensions. And they may start to worry: I cant think in four dimensions, let alone five (or five million). Is there some special ability theyre missing, some ability that real supermathematicians have Of course, the answer is no. Even most professional mathematicians cant visualize four dimensions especially well, if at all. The trick they use, instead, is to develop other ways of representing whats going on. Thats exactly what we did above: we used an algebraic (rather than visual) representation of Delta C to figure out how to move so as to decrease C. People who are good at thinking in high dimensions have a mental library containing many different techniques along these lines our algebraic trick is just one example. Those techniques may not have the simplicity were accustomed to when visualizing three dimensions, but once you build up a library of such techniques, you can get pretty good at thinking in high dimensions. I wont go into more detail here, but if youre interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions. While some of the techniques discussed are quite complex, much of the best content is intuitive and accessible, and could be mastered by anyone. Alright, lets write a program that learns how to recognize handwritten digits, using stochastic gradient descent and the MNIST training data. Well do this with a short Python (2.7) program, just 74 lines of code The first thing we need is to get the MNIST data. If youre a git user then you can obtain the data by cloning the code repository for this book, If you dont use git then you can download the data and code here . Incidentally, when I described the MNIST data earlier, I said it was split into 60,000 training images, and 10,000 test images. Thats the official MNIST description. Actually, were going to split the data a little differently. Well leave the test images as is, but split the 60,000-image MNIST training set into two parts: a set of 50,000 images, which well use to train our neural network, and a separate 10,000 image validation set . We wont use the validation data in this chapter, but later in the book well find it useful in figuring out how to set certain hyper-parameters of the neural network - things like the learning rate, and so on, which arent directly selected by our learning algorithm. Although the validation data isnt part of the original MNIST specification, many people use MNIST in this fashion, and the use of validation data is common in neural networks. When I refer to the MNIST training data from now on, Ill be referring to our 50,000 image data set, not the original 60,000 image data set As noted earlier, the MNIST data set is based on two data sets collected by NIST, the United States National Institute of Standards and Technology. To construct MNIST the NIST data sets were stripped down and put into a more convenient format by Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. See this link for more details. The data set in my repository is in a form that makes it easy to load and manipulate the MNIST data in Python. I obtained this particular form of the data from the LISA machine learning laboratory at the University of Montreal (link ). Apart from the MNIST data we also need a Python library called Numpy. for doing fast linear algebra. If you dont already have Numpy installed, you can get it here . Let me explain the core features of the neural networks code, before giving a full listing, below. The centerpiece is a Network class, which we use to represent a neural network. Heres the code we use to initialize a Network object: In this code, the list sizes contains the number of neurons in the respective layers. So, for example, if we want to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer, wed do this with the code: The biases and weights in the Network object are all initialized randomly, using the Numpy np. random. randn function to generate Gaussian distributions with mean 0 and standard deviation 1. This random initialization gives our stochastic gradient descent algorithm a place to start from. In later chapters well find better ways of initializing the weights and biases, but this will do for now. Note that the Network initialization code assumes that the first layer of neurons is an input layer, and omits to set any biases for those neurons, since biases are only ever used in computing the outputs from later layers. Note also that the biases and weights are stored as lists of Numpy matrices. So, for example net. weights1 is a Numpy matrix storing the weights connecting the second and third layers of neurons. (Its not the first and second layers, since Pythons list indexing starts at 0 .) Since net. weights1 is rather verbose, lets just denote that matrix w. Its a matrix such that w is the weight for the connection between the k neuron in the second layer, and the j neuron in the third layer. This ordering of the j and k indices may seem strange - surely itd make more sense to swap the j and k indices around The big advantage of using this ordering is that it means that the vector of activations of the third layer of neurons is: begin a sigma(w a b). tag end Theres quite a bit going on in this equation, so lets unpack it piece by piece. a is the vector of activations of the second layer of neurons. To obtain a we multiply a by the weight matrix w, and add the vector b of biases. We then apply the function sigma elementwise to every entry in the vector w a b. (This is called vectorizing the function sigma.) Its easy to verify that Equation (22) begin a sigma(w a b) nonumberend gives the same result as our earlier rule, Equation (4) begin frac nonumberend . for computing the output of a sigmoid neuron. Write out Equation (22) begin a sigma(w a b) nonumberend in component form, and verify that it gives the same result as the rule (4) begin frac nonumberend for computing the output of a sigmoid neuron. With all this in mind, its easy to write code computing the output from a Network instance. We begin by defining the sigmoid function: Note that when the input z is a vector or Numpy array, Numpy automatically applies the function sigmoid elementwise, that is, in vectorized form. We then add a feedforward method to the Network class, which, given an input a for the network, returns the corresponding output It is assumed that the input a is an (n, 1) Numpy ndarray, not a (n,) vector. Here, n is the number of inputs to the network. If you try to use an (n,) vector as input youll get strange results. Although using an (n,) vector appears the more natural choice, using an (n, 1) ndarray makes it particularly easy to modify the code to feedforward multiple inputs at once, and that is sometimes convenient. All the method does is applies Equation (22) begin a sigma(w a b) nonumberend for each layer: Of course, the main thing we want our Network objects to do is to learn. To that end well give them an SGD method which implements stochastic gradient descent. Heres the code. Its a little mysterious in a few places, but Ill break it down below, after the listing. The trainingdata is a list of tuples (x, y) representing the training inputs and corresponding desired outputs. The variables epochs and minibatchsize are what youd expect - the number of epochs to train for, and the size of the mini-batches to use when sampling. eta is the learning rate, eta. If the optional argument testdata is supplied, then the program will evaluate the network after each epoch of training, and print out partial progress. This is useful for tracking progress, but slows things down substantially. The code works as follows. In each epoch, it starts by randomly shuffling the training data, and then partitions it into mini-batches of the appropriate size. This is an easy way of sampling randomly from the training data. Then for each minibatch we apply a single step of gradient descent. This is done by the code self. updateminibatch(minibatch, eta) . which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in minibatch . Heres the code for the updateminibatch method: Most of the work is done by the line This invokes something called the backpropagation algorithm, which is a fast way of computing the gradient of the cost function. So updateminibatch works simply by computing these gradients for every training example in the minibatch . and then updating self. weights and self. biases appropriately. Im not going to show the code for self. backprop right now. Well study how backpropagation works in the next chapter, including the code for self. backprop . For now, just assume that it behaves as claimed, returning the appropriate gradient for the cost associated to the training example x . Lets look at the full program, including the documentation strings, which I omitted above. Apart from self. backprop the program is self-explanatory - all the heavy lifting is done in self. SGD and self. updateminibatch . which weve already discussed. The self. backprop method makes use of a few extra functions to help in computing the gradient, namely sigmoidprime . which computes the derivative of the sigma function, and self. costderivative . which I wont describe here. You can get the gist of these (and perhaps the details) just by looking at the code and documentation strings. Well look at them in detail in the next chapter. Note that while the program appears lengthy, much of the code is documentation strings intended to make the code easy to understand. In fact, the program contains just 74 lines of non-whitespace, non-comment code. All the code may be found on GitHub here . How well does the program recognize handwritten digits Well, lets start by loading in the MNIST data. Ill do this using a little helper program, mnistloader. py . to be described below. We execute the following commands in a Python shell, Of course, this could also be done in a separate Python program, but if youre following along its probably easiest to do in a Python shell. After loading the MNIST data, well set up a Network with 30 hidden neurons. We do this after importing the Python program listed above, which is named network , Finally, well use stochastic gradient descent to learn from the MNIST trainingdata over 30 epochs, with a mini-batch size of 10, and a learning rate of eta 3.0, Note that if youre running the code as you read along, it will take some time to execute - for a typical machine (as of 2015) it will likely take a few minutes to run. I suggest you set things running, continue to read, and periodically check the output from the code. If youre in a rush you can speed things up by decreasing the number of epochs, by decreasing the number of hidden neurons, or by using only part of the training data. Note that production code would be much, much faster: these Python scripts are intended to help you understand how neural nets work, not to be high-performance code And, of course, once weve trained a network it can be run very quickly indeed, on almost any computing platform. For example, once weve learned a good set of weights and biases for a network, it can easily be ported to run in Javascript in a web browser, or as a native app on a mobile device. In any case, here is a partial transcript of the output of one training run of the neural network. The transcript shows the number of test images correctly recognized by the neural network after each epoch of training. As you can see, after just a single epoch this has reached 9,129 out of 10,000, and the number continues to grow, That is, the trained network gives us a classification rate of about 95 percent - 95.42 percent at its peak (Epoch 28) Thats quite encouraging as a first attempt. I should warn you, however, that if you run the code then your results are not necessarily going to be quite the same as mine, since well be initializing our network using (different) random weights and biases. To generate results in this chapter Ive taken best-of-three runs. Lets rerun the above experiment, changing the number of hidden neurons to 100. As was the case earlier, if youre running the code as you read along, you should be warned that it takes quite a while to execute (on my machine this experiment takes tens of seconds for each training epoch), so its wise to continue reading in parallel while the code executes. Sure enough, this improves the results to 96.59 percent. At least in this case, using more hidden neurons helps us get better results Reader feedback indicates quite some variation in results for this experiment, and some training runs give results quite a bit worse. Using the techniques introduced in chapter 3 will greatly reduce the variation in performance across different training runs for our networks. Of course, to obtain these accuracies I had to make specific choices for the number of epochs of training, the mini-batch size, and the learning rate, eta. As I mentioned above, these are known as hyper-parameters for our neural network, in order to distinguish them from the parameters (weights and biases) learnt by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results. Suppose, for example, that wed chosen the learning rate to be eta 0.001, The results are much less encouraging, However, you can see that the performance of the network is getting slowly better over time. That suggests increasing the learning rate, say to eta 0.01. If we do that, we get better results, which suggests increasing the learning rate again. (If making a change improves things, try doing more) If we do that several times over, well end up with a learning rate of something like eta 1.0 (and perhaps fine tune to 3.0), which is close to our earlier experiments. So even though we initially made a poor choice of hyper-parameters, we at least got enough information to help us improve our choice of hyper-parameters. In general, debugging a neural network can be challenging. This is especially true when the initial choice of hyper-parameters produces results no better than random noise. Suppose we try the successful 30 hidden neuron network architecture from earlier, but with the learning rate changed to eta 100.0: At this point weve actually gone too far, and the learning rate is too high: Now imagine that we were coming to this problem for the first time. Of course, we know from our earlier experiments that the right thing to do is to decrease the learning rate. But if we were coming to this problem for the first time then there wouldnt be much in the output to guide us on what to do. We might worry not only about the learning rate, but about every other aspect of our neural network. We might wonder if weve initialized the weights and biases in a way that makes it hard for the network to learn Or maybe we dont have enough training data to get meaningful learning Perhaps we havent run for enough epochs Or maybe its impossible for a neural network with this architecture to learn to recognize handwritten digits Maybe the learning rate is too low . Or, maybe, the learning rate is too high When youre coming to a problem for the first time, youre not always sure. The lesson to take away from this is that debugging a neural network is not trivial, and, just as for ordinary programming, there is an art to it. You need to learn that art of debugging in order to get good results from neural networks. More generally, we need to develop heuristics for choosing good hyper-parameters and a good architecture. Well discuss all these at length through the book, including how I chose the hyper-parameters above. Try creating a network with just two layers - an input and an output layer, no hidden layer - with 784 and 10 neurons, respectively. Train the network using stochastic gradient descent. What classification accuracy can you achieve Earlier, I skipped over the details of how the MNIST data is loaded. Its pretty straightforward. For completeness, heres the code. The data structures used to store the MNIST data are described in the documentation strings - its straightforward stuff, tuples and lists of Numpy ndarray objects (think of them as vectors if youre not familiar with ndarray s): I said above that our program gets pretty good results. What does that mean Good compared to what Its informative to have some simple (non-neural-network) baseline tests to compare against, to understand what it means to perform well. The simplest baseline of all, of course, is to randomly guess the digit. Thatll be right about ten percent of the time. Were doing much better than that What about a less trivial baseline Lets try an extremely simple idea: well look at how dark an image is. For instance, an image of a 2 will typically be quite a bit darker than an image of a 1, just because more pixels are blackened out, as the following examples illustrate: This suggests using the training data to compute average darknesses for each digit, 0, 1, 2,ldots, 9. When presented with a new image, we compute how dark the image is, and then guess that its whichever digit has the closest average darkness. This is a simple procedure, and is easy to code up, so I wont explicitly write out the code - if youre interested its in the GitHub repository. But its a big improvement over random guessing, getting 2,225 of the 10,000 test images correct, i. e. 22.25 percent accuracy. Its not difficult to find other ideas which achieve accuracies in the 20 to 50 percent range. If you work a bit harder you can get up over 50 percent. But to get much higher accuracies it helps to use established machine learning algorithms. Lets try using one of the best known algorithms, the support vector machine or SVM . If youre not familiar with SVMs, not to worry, were not going to need to understand the details of how SVMs work. Instead, well use a Python library called scikit-learn. which provides a simple Python interface to a fast C-based library for SVMs known as LIBSVM . If we run scikit-learns SVM classifier using the default settings, then it gets 9,435 of 10,000 test images correct. (The code is available here .) Thats a big improvement over our naive approach of classifying an image based on how dark it is. Indeed, it means that the SVM is performing roughly as well as our neural networks, just a little worse. In later chapters well introduce new techniques that enable us to improve our neural networks so that they perform much better than the SVM. Thats not the end of the story, however. The 9,435 of 10,000 result is for scikit-learns default settings for SVMs. SVMs have a number of tunable parameters, and its possible to search for parameters which improve this out-of-the-box performance. I wont explicitly do this search, but instead refer you to this blog post by Andreas Mueller if youd like to know more. Mueller shows that with some work optimizing the SVMs parameters its possible to get the performance up above 98.5 percent accuracy. In other words, a well-tuned SVM only makes an error on about one digit in 70. Thats pretty good Can neural networks do better In fact, they can. At present, well-designed neural networks outperform every other technique for solving MNIST, including SVMs. The current (2013) record is classifying 9,979 of 10,000 images correctly. This was done by Li Wan. Matthew Zeiler. Sixin Zhang, Yann LeCun. and Rob Fergus. Well see most of the techniques they used later in the book. At that level the performance is close to human-equivalent, and is arguably better, since quite a few of the MNIST images are difficult even for humans to recognize with confidence, for example: I trust youll agree that those are tough to classify With images like these in the MNIST data set its remarkable that neural networks can accurately classify all but 21 of the 10,000 test images. Usually, when programming we believe that solving a complicated problem like recognizing the MNIST digits requires a sophisticated algorithm. But even the neural networks in the Wan et al paper just mentioned involve quite simple algorithms, variations on the algorithm weve seen in this chapter. All the complexity is learned, automatically, from the training data. In some sense, the moral of both our results and those in more sophisticated papers, is that for some problems: sophisticated algorithm leq simple learning algorithm good training data. While our neural network gives impressive performance, that performance is somewhat mysterious. The weights and biases in the network were discovered automatically. And that means we dont immediately have an explanation of how the network does what it does. Can we find some way to understand the principles by which our network is classifying handwritten digits And, given such principles, can we do better To put these questions more starkly, suppose that a few decades hence neural networks lead to artificial intelligence (AI). Will we understand how such intelligent networks work Perhaps the networks will be opaque to us, with weights and biases we dont understand, because theyve been learned automatically. In the early days of AI research people hoped that the effort to build an AI would also help us understand the principles behind intelligence and, maybe, the functioning of the human brain. But perhaps the outcome will be that we end up understanding neither the brain nor how artificial intelligence works To address these questions, lets think back to the interpretation of artificial neurons that I gave at the start of the chapter, as a means of weighing evidence. Suppose we want to determine whether an image shows a human face or not: Credits: 1. Ester Inbar. 2. Unknown. 3. NASA, ESA, G. Illingworth, D. Magee, and P. Oesch (University of California, Santa Cruz), R. Bouwens (Leiden University), and the HUDF09 Team. Click on the images for more details. We could attack this problem the same way we attacked handwriting recognition - by using the pixels in the image as input to a neural network, with the output from the network a single neuron indicating either Yes, its a face or No, its not a face. Lets suppose we do this, but that were not using a learning algorithm. Instead, were going to try to design a network by hand, choosing appropriate weights and biases. How might we go about it Forgetting neural networks entirely for the moment, a heuristic we could use is to decompose the problem into sub-problems: does the image have an eye in the top left Does it have an eye in the top right Does it have a nose in the middle Does it have a mouth in the bottom middle Is there hair on top And so on. If the answers to several of these questions are yes, or even just probably yes, then wed conclude that the image is likely to be a face. Conversely, if the answers to most of the questions are no, then the image probably isnt a face. Of course, this is just a rough heuristic, and it suffers from many deficiencies. Maybe the person is bald, so they have no hair. Maybe we can only see part of the face, or the face is at an angle, so some of the facial features are obscured. Still, the heuristic suggests that if we can solve the sub-problems using neural networks, then perhaps we can build a neural network for face-detection, by combining the networks for the sub-problems. Heres a possible architecture, with rectangles denoting the sub-networks. Note that this isnt intended as a realistic approach to solving the face-detection problem rather, its to help us build intuition about how networks function. Heres the architecture: Its also plausible that the sub-networks can be decomposed. Suppose were considering the question: Is there an eye in the top left This can be decomposed into questions such as: Is there an eyebrow Are there eyelashes Is there an iris and so on. Of course, these questions should really include positional information, as well - Is the eyebrow in the top left, and above the iris, that kind of thing - but lets keep it simple. The network to answer the question Is there an eye in the top left can now be decomposed: Those questions too can be broken down, further and further through multiple layers. Ultimately, well be working with sub-networks that answer questions so simple they can easily be answered at the level of single pixels. Those questions might, for example, be about the presence or absence of very simple shapes at particular points in the image. Such questions can be answered by single neurons connected to the raw pixels in the image. The end result is a network which breaks down a very complicated question - does this image show a face or not - into very simple questions answerable at the level of single pixels. It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts. Networks with this kind of many-layer structure - two or more hidden layers - are called deep neural networks . Of course, I havent said how to do this recursive decomposition into sub-networks. It certainly isnt practical to hand-design the weights and biases in the network. Instead, wed like to use learning algorithms so that the network can automatically learn the weights and biases - and thus, the hierarchy of concepts - from training data. Researchers in the 1980s and 1990s tried using stochastic gradient descent and backpropagation to train deep networks. Unfortunately, except for a few special architectures, they didnt have much luck. The networks would learn, but very slowly, and in practice often too slowly to be useful. Since 2006, a set of techniques has been developed that enable learning in deep neural nets. These deep learning techniques are based on stochastic gradient descent and backpropagation, but also introduce new ideas. These techniques have enabled much deeper (and larger) networks to be trained - people now routinely train networks with 5 to 10 hidden layers. And, it turns out that these perform far better on many problems than shallow neural networks, i. e. networks with just a single hidden layer. The reason, of course, is the ability of deep nets to build up a complex hierarchy of concepts. Its a bit like the way conventional programming languages use modular design and ideas about abstraction to enable the creation of complex computer programs. Comparing a deep network to a shallow network is a bit like comparing a programming language with the ability to make function calls to a stripped down language with no ability to make such calls. Abstraction takes a different form in neural networks than it does in conventional programming, but its just as important. In academic work, please cite this book as: Michael A. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015 This work is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License. This means youre free to copy, share, and build on this book, but not to sell it. If youre interested in commercial use, please contact me. Last update: Sun Jan 1 16:00:21 2017
No comments:
Post a Comment